Standard deviation of generic list? [duplicate]

Standard deviation of generic list? [duplicate] - c#

This question already has answers here:
How do I determine the standard deviation (stddev) of a set of values?
(12 answers)
Standard Deviation in LINQ
(8 answers)
Closed 9 years ago.
I need to calculate the standard deviation of a generic list. I will try to include my code. Its a generic list with data in it. The data is mostly floats and ints. Here is my code that is relative to it without getting into to much detail:
namespace ValveTesterInterface
{
public class ValveDataResults
{
private List<ValveData> m_ValveResults;
public ValveDataResults()
{
if (m_ValveResults == null)
{
m_ValveResults = new List<ValveData>();
}
}
public void AddValveData(ValveData valve)
{
m_ValveResults.Add(valve);
}
Here is the function where the standard deviation needs to be calculated:
public float LatchStdev()
{
float sumOfSqrs = 0;
float meanValue = 0;
foreach (ValveData value in m_ValveResults)
{
meanValue += value.LatchTime;
}
meanValue = (meanValue / m_ValveResults.Count) * 0.02f;
for (int i = 0; i <= m_ValveResults.Count; i++)
{
sumOfSqrs += Math.Pow((m_ValveResults - meanValue), 2);
}
return Math.Sqrt(sumOfSqrs /(m_ValveResults.Count - 1));
}
}
}
Ignore whats inside the LatchStdev() function because I'm sure its not right. Its just my poor attempt to calculate the st dev. I know how to do it of a list of doubles, however not of a list of generic data list. If someone had experience in this, please help.

The example above is slightly incorrect and could have a divide by zero error if your population set is 1. The following code is somewhat simpler and gives the "population standard deviation" result. (http://en.wikipedia.org/wiki/Standard_deviation)
using System;
using System.Linq;
using System.Collections.Generic;
public static class Extend
{
public static double StandardDeviation(this IEnumerable<double> values)
{
double avg = values.Average();
return Math.Sqrt(values.Average(v=>Math.Pow(v-avg,2)));
}
}

This article should help you. It creates a function that computes the deviation of a sequence of double values. All you have to do is supply a sequence of appropriate data elements.
The resulting function is:
private double CalculateStandardDeviation(IEnumerable<double> values)
{
double standardDeviation = 0;
if (values.Any())
{
// Compute the average.
double avg = values.Average();
// Perform the Sum of (value-avg)_2_2.
double sum = values.Sum(d => Math.Pow(d - avg, 2));
// Put it all together.
standardDeviation = Math.Sqrt((sum) / (values.Count()-1));
}
return standardDeviation;
}
This is easy enough to adapt for any generic type, so long as we provide a selector for the value being computed. LINQ is great for that, the Select funciton allows you to project from your generic list of custom types a sequence of numeric values for which to compute the standard deviation:
List<ValveData> list = ...
var result = list.Select( v => (double)v.SomeField )
.CalculateStdDev();

Even though the accepted answer seems mathematically correct, it is wrong from the programming perspective - it enumerates the same sequence 4 times. This might be ok if the underlying object is a list or an array, but if the input is a filtered/aggregated/etc linq expression, or if the data is coming directly from the database or network stream, this would cause much lower performance.
I would highly recommend not to reinvent the wheel and use one of the better open source math libraries Math.NET. We have been using that lib in our company and are very happy with the performance.
PM> Install-Package MathNet.Numerics
var populationStdDev = new List<double>(1d, 2d, 3d, 4d, 5d).PopulationStandardDeviation();
var sampleStdDev = new List<double>(2d, 3d, 4d).StandardDeviation();
See http://numerics.mathdotnet.com/docs/DescriptiveStatistics.html for more information.
Lastly, for those who want to get the fastest possible result and sacrifice some precision, read "one-pass" algorithm https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods

I see what you're doing, and I use something similar. It seems to me you're not going far enough. I tend to encapsulate all data processing into a single class, that way I can cache the values that are calculated until the list changes.
for instance:
public class StatProcessor{
private list<double> _data; //this holds the current data
private _avg; //we cache average here
private _avgValid; //a flag to say weather we need to calculate the average or not
private _calcAvg(); //calculate the average of the list and cache in _avg, and set _avgValid
public double average{
get{
if(!_avgValid) //if we dont HAVE to calculate the average, skip it
_calcAvg(); //if we do, go ahead, cache it, then set the flag.
return _avg; //now _avg is garunteed to be good, so return it.
}
}
...more stuff
Add(){
//add stuff to the list here, and reset the flag
}
}
You'll notice that using this method, only the first request for average actually computes the average. After that, as long as we don't add (or remove, or modify at all, but those arnt shown) anything from the list, we can get the average for basically nothing.
Additionally, since the average is used in the algorithm for the standard deviation, computing the standard deviation first will give us the average for free, and computing the average first will give us a little performance boost in the standard devation calculation, assuming we remember to check the flag.
Furthermore! places like the average function, where you're looping through every value already anyway, is a great time to cache things like the minimum and maximum values. Of course, requests for this information need to first check whether theyve been cached, and that can cause a relative slowdown compared to just finding the max using the list, since it does all the extra work setting up all the concerned caches, not just the one your accessing.

Related

.NET equivalent of Java's TreeSet.floor & TreeSet.ceiling

As an example, there's a Binary Search Tree which holds a range of values. Before adding a new value, I need to check if it already contains it's 'almost duplicate'. I have Java solution which simply performs floor and ceiling and further condition to do the job.
JAVA: Given a TreeSet, floor() returns the greatest element in this set less than or equal to the given element; ceiling() returns the least element in this set greater than or equal to the given element
TreeSet<Long> set = new TreeSet<>();
long l = (long)1; // anything
Long floor = set.floor(l);
Long ceil = set.ceiling(l);
C#: Closest data structure seems to be SortedSet<>. Could anyone advise the best way to get floor and ceil results for an input value?
SortedSet<long> set = new SortedSet<long>();

The above, as mentioned, is not the answer since this is a tree we expect logarithmic times. Java's floor and ceiling methods are logarithmic. GetViewBetween is logarigmic and so are Max and Min, so:
floor for SortedSet<long>:
sortedSet.GetViewBetween(long.MinValue, num).Max
ceiling for SortedSet<long>:
sortedSet.GetViewBetween(num, long.MaxValue).Min

You can use something like this. In Linq there is LastOrDefault method:
var floor = sortedSet.LastOrDefault(i => i < num);
// num is the number whose floor is to be calculated
if (! (floor < sortedSet.ElementAt(0)))
{
// we have a floor
}
else
// nothing is smaller in the set
{
}

Optimization Algorithm in C#

I have an optimization issue that I'm not sure where to go from here. I have a program that tries to find the best combination of inputs that return the highest predicted r squared value. The problem is that I have 21 total inputs (List) and I need them in a set of 15 inputs. The formula for total combinations is:
n! / r!(n - r)! = 21! / 15!(21 - 15)! = 54,264 possible combinations
So obviously running through each combination and calculating the predicted rsquared is not an ideal solution so is there an better way/algorithm/method I can use to try to skip or narrow down the bad combinations so that I'm only processing the fewest amount of combinations? Here is my current psuedo code for this issue:
public BestCombo GetBestCombo(List<List<MultipleRegressionInfo>> combosList)
{
BestCombo bestCombo = new BestCombo();
foreach (var combo in combosList)
{
var predRsquared = CalculatePredictedRSquared(combo);
if (predRsquared > bestCombo.predRSquared)
{
bestCombo.predRSquared = predRsquared;
bestCombo.BestRSquaredCombo = combo;
}
}
return bestCombo;
}
public class BestCombo
{
public double predRSquared { get; set; }
public IEnumerable<MultipleRegressionInfo> BestRSquaredCombo { get; set; }
}
public class MultipleRegressionInfo
{
public List<double> input { get; set; }
public List<double> output { get; set; }
}
public double CalculatePredictedRSquared(List<MultipleRegressionInfo> combo)
{
Matrix<double> matrix = BuildMatrix(combo.Select(i => i.input).ToArray());
Vector<double> vector = BuildVector(combo.ElementAt(0).output);
var coefficients = CalculateWithQR(matrix, vector);
var y = CalculateYIntercept(coefficients, input, output);
var estimateList = CalculateEstimates(coefficients, y, input, output);
return GetPredRsquared(estimateList, output);
}

54,264 is not enormous for a computer - it might be worth timing a few calls to compute R^2 and multiplying up to see just how long this would take.
There is a branch and bound algorithm for this sort of problem, which relies on the fact that R^2(A,B,C) >= R^2(A,B) - that the R^2 can only decrease when you drop a variable. Recursively search the space of all sets of variables of size at least 15. After computing the R^2 for a set of variables, make recursive calls with sets produced by dropping a single variable from the set, where any such drop must be to the right of any existing gap (so A.CDE produces A..DE, A.C.E, and A.CD. but not ..CDE, which will be produced by .BCDE). You can terminate the recursion when you get down to the desired size of set, or when you find an R^2 that is no better than the best answer so far.
If it happens that you often find R^2 values no better than the best answer so far, this will save time - but this is not guaranteed. You can attempt to improve the efficiency by chosing to investigate the sets with highest R^2 first, hoping that you find a new best answer good enough to rule out their siblings by the time you come to them, and by using a procedure to calculate R^2 for A.CDE that makes use of the calculations you have already done for ABCDE.

Output of code using System.Random does not approach theoretical limit as iterations increase

I'm not good at stats, so I tried to solve a simple problem in C#. The problem: "A given team has a 65% chance to win a single game against another team. What is the probability that they will win a best-of-5 set?"
I wanted to look at the relationship between that probability and the number of games in the set. How does a Bo3 compare to a Bo5, and so on?
I did this by creating Set and Game objects and running iterations. The win decision is done with this code:
Won = rnd.Next(1, 100) <= winChance;
rnd is, as you might expect, a static System.Random object.
Here's my Set object code:
public class Set
{
public int NumberOfGames { get; private set; }
public List<Game> Games { get; private set; }
public Set(int numberOfGames, int winChancePct)
{
NumberOfGames = numberOfGames;
GamesNeededToWin = Convert.ToInt32(Math.Ceiling(NumberOfGames / 2m));
Games = Enumerable.Range(1, numberOfGames)
.Select(i => new Game(winChancePct))
.ToList();
}
public int GamesNeededToWin { get; private set; }
public bool WonSet => Games.Count(g => g.Won) >= GamesNeededToWin;
}
My issue is that the results I get aren't quite what they should be. Someone who sucks less at stats did the math for me, and it seems my code is always overestimating the chance of winning the set, and the number of iterations doesn't improve the accuracy.
The results I get (% set win by games per set) are below. The first column is the games per set, the next is the statistical win rate (which my results should be approaching), and the remaining columns are my results based on the number of iterations. As you can see, more iterations don't seem to be making the numbers more accurate.
Games Per Set|Expected Set Win Rate|10K|100K|1M|10M
1 65.0% 66.0% 65.6% 65.7% 65.7%
3 71.8% 72.5% 72.7% 72.7% 72.7%
5 76.5% 78.6% 77.4% 77.5% 77.5%
7 80.0% 80.7% 81.2% 81.0% 81.1%
9 82.8% 84.1% 83.9% 83.9% 83.9%
The entire project is posted on github here if you want to look.
Any insight into why this isn't producing accurate results would be greatly appreciated.

The answer of Darren Sisson is correct; your computation is off by approximately 1%, and so all your results are as well.
My recommendation is that you solve the problem by encapsulating your desired semantics into an object which you can then test independently:
interface IDistribution<T>
{
T Sample();
}
static class Extensions
{
public static IEnumerable<T> Samples(this IDistribution<T> d)
{
while (true) yield return d.Sample();
}
}
class Bernoulli : IDistribution<bool>
{
// Note that we could also make it IDistribution<int> and return
// 0 and 1 instead of false and true; that would be the more
// "classic" approach to a Bernoulli distribution. Your choice.
private double d;
private Random random = new Random();
private Bernoulli(double d) { this.d = d; }
public static Make(double d) => new Bernoulli(d);
public bool Sample() => random.NextDouble() < d;
}
And now you have a biased coin flipper which you can test independently. You can now write code like:
int flips = 1000;
int heads = Bernoulli
.Make(0.65)
.Samples()
.Take(flips)
.Where(x => x)
.Count();
to do 1000 coin flips with a 65% chance of heads.
Notice that what we are doing here is constructing a probability distribution monad and then using the tools of LINQ to express a conditional probability. This is a powerful technique; your application barely scratches the surface of what we can do with it.
Exercise: construct extension methods Where, Select and SelectMany which take not IEnumerable<T> but rather IDistribution<T>; can you express the semantics of the distribution in terms of the distribution type itself, rather than making a transformation from the distribution monad to the sequence monad? Can you do the same for zip joins?
Exercise: construct other implementations of IDistribution<T>. Can you construct, say, a Cauchy distribution of doubles? What about a normal distribution? What about a dice-rolling distribution on a fair die of n sides? Now, can you put this all together? What is the distribution which is: flip a coin; if heads, roll four dice and add them together, otherwise roll two dice and discard all the doubles, and multiply the results.

quick look, The Random function's upper bound is exclusive so would need to be set to 101

Pitfalls in C# for a new user. (FWHM calculation)

This is my idea to program a simple math module (function) that can be called from another main program. It calculates the FWHM(full width at half the max) of a curve. Since this is my first try at Visual Studio and C#. I would like to know few basic programming structures I should learn in C# coming from a Mathematica background.
Is double fwhm(double[] data, int c) indicate the input arguments
to this function fwhm should be a double data array and an Integer
value? Did I get this right?
I find it difficult to express complex mathematical equations (line 32/33) to express them in parenthesis and divide one by another, whats the right method to do that?
How can I perform Mathematical functions on elements of an Array like division and store the results in the same Array?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace DEV_2
{
class fwhm
{
static double fwhm(double[] data, int c) // data as 2d data and c is integer
{
double[] datax;
double[] datay;
int L;
int Mag = 4;
double PP = 2.2;
int CI;
int k;
double Interp;
double Tlead;
double Ttrail;
double fwhm;
L = datay.Length;
// Create datax as index for the number of elemts in data from 1-Length(data).
for (int i = 1; i <= data.Length; i++)
{
datax[i] = (i + 1);
}
//Find max in datay and divide all elements by maxValue.
var m = datay.Length; // Find length of datay
Array.ForEach(datay, (x) => {datay[m++] = x / datay.Max();}); // Divide all elements of datay by max(datay)
double maxValue = datay.Max();
CI = datay.ToList().IndexOf(maxValue); // Push that index to CI
// Start to search lead
int k = 2;
while (Math.Sign(datay[k]) == Math.Sign(datay[k-1]-0.5))
{
k=k+1;
}
Interp = (0.5-datay[k-1])/(datay[k]-datay[k-1]);
Tlead = datax[k-1]+Interp*(datax[k]-datax[k-1]);
CI = CI+1;
// Start search for the trail
while (Math.Sign(datay[k]-0.5) == Math.Sign(datay[k-1]-0.5) && (k<=L-1))
{
k=k+1;
}
if (k != L)
{
Interp = (0.5-datay[k-1])/(datay[k]-datay[k-1]);
Ttrail = datax[k-1] + Interp*(datax[k]-datax[k-1]);
fwhm =((Ttrail-Tlead)*PP)/Mag;
}
}//end main
}//end class
}//end namespace

There are plenty of pitfalls in C#, but working through problems is a great way to find and learn them!
Yes, when passing parameters to a method the correct syntax is MethodName(varType varName) seperated by a comma for multiple parameters. Some pitfalls arise here with differences in passing Value types and Reference types. If you're interested here is some reading on the subject.
Edit: As pointed out in the comments you should write code as best as possible to require as few comments as possible (thus paragraph between #3 and #4), however if you need to do very specific and slightly complex math then you should comment to clarify what is occuring.
If you mean difficulties understanding, make sure you comment your code properly. If you mean difficulties writing it, you can create variables to simplify reading your code (but generally unnecessary) or look up functions or libraries to help you, this is a bit open ended question if you have a particular functionality you are looking for perhaps we could be of more help.
You can access your array via indexes such as array[i] will get the ith index. Following this you can manipulate the data that said index is pointing to in any way you wish, array[i] = (array[i]/24)^3 or array[i] = doMath(array[i])
A couple things you can do if you like to clean a little, but they are preference based, is not declare int CI; int k; in your code before you initialize them with int k = 2;, there is no need (although you can if it helps you). The other thing is to correctly name your variables, common practice is a more descriptive camelCase naming, so perhaps instead of int CI = datay.ToList().IndexOf(maxValue); you coud use int indexMaxValueYData = datay.ToList().IndexOf(maxValue);
As per your comment question "What would this method return?" The method will return a double, as declared above. returnType methodName(parameters) However you need to add that in your code, as of now I see no return line. Such as return doubleVar; where doubleVar is a variable of type double.

C# getting first digit of int in custom class

I am trying to build a help function in my guess the number game, whereby the user gets the first digit of the number he/she has to guess. So if the generated number is 550, he will get the 5.
I have tried a lot of things, maybe one of you has an idea what is wrong?
public partial class Class3
{
public Class3()
{
double test = Convert.ToDouble(globalVariableNumber.number);
while (test > 10)
{
double firstDigit = test / 10;
test = Math.Round(test);
globalVariableNumber.helpMe = Convert.ToString(firstDigit);
}
}
}
Under the helpButton clicked I have:
private void helpButton_Click(object sender, EventArgs e)
{
label3.Text = globalVariableNumber.helpMe;
label3.AutoSize = true;
That is my latest try, I putted all of this in a custom class. In the main I putted the code to show what is in the helpMe string.
If you need more code please tell me

Why not ToString the number and use Substring to get the first character?
var number = 550;
var result = number.ToString().Substring(0, 1);
If for some reason you dont want to use string manipulation you could do this mathematically like this
var number = 550;
var result = Math.Floor(number / Math.Pow(10, Math.Floor(Math.Log10(number))));

What's wrong - you have an infinite while loop there. Math.Round(test) will leave the value of test unchanged after the first iteration.
You may have intended to use firstDigit as the variable controlling the loop.
Anyway, as suggested by others, you can set helpMe to the first digit by converting to a string and using the first character.
As an aside, you should consider supplying the number as a parameter and returning the helpMe string from the method. Your current approach is a little brittle.

The problem with your code is that you are doing the division and storing that in a separate variable, then you round the original value. That means that the original value only changes in the first iteration of the loop (and is only rounded, not divided), and unless that happens to make the loop condition false (i.e. for values between 10 and 10.5), the loop will never end.
Changes:
Use an int intead of a double, that gets you away from a whole bunch of potential precision problems.
Use the >= operator rather than >. If you get the value 10 then you want the loop to go on for another iteration to get a single digit.
You would use Math.Floor instead of Math.Round as you don't want the first digit to be rounded up, i.e. getting the first digit for 460 as 5. However, if you are using an integer then the division truncates the result, so there is no need to do any rounding at all.
Divide the value and store it back into the same variable.
Use the value after the loop, there is no point in updating it while you still have multiple digits in the variable.
Code:
int test = (int)globalVariableNumber.number;
while (test >= 10) {
test = test / 10;
}
globalVariableNumber.helpMe = test.ToString();

By using Math.Round(), in your example, you're rounding 5.5 to 6 (it's the even integer per the documentation). Use Math.Floor instead, this will drop the decimal point but give you the number you're expecting for this test.
i.e.
double test = Convert.ToDouble(globalVariableNumber.number);
while (test > 10)
{
test = Math.Floor(test / 10);
globalVariableNumber.helpMe = Convert.ToString(firstDigit);
}
Like #Sam Greenhalgh mentions, though, returning the first character of the number as a string will be cleaner, quicker and easier.
globalVariableNumber.helpMe = test >= 10
? test.ToString().SubString(0, 1)
: "Hint not possible, number is less than ten"
This assumes that helpMe is a string.
Per our discussion in the comments, you'd be better off doing it like this:
private void helpButton_Click(object sender, EventArgs e)
{
label3.Text = GetHelpText();
label3.AutoSize = true;
}
// Always good practice to name a method that returns something Get...
// Also good practice to give it a descriptive name.
private string GetHelpText()
{
return test >= 10 // The ?: operator just means if the first part is true...
? test.ToString().SubString(0, 1) // use this, otherwise...
: "Hint not possible, number is less than ten" // use this.
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Standard deviation of generic list? [duplicate] - c#

Related

.NET equivalent of Java's TreeSet.floor & TreeSet.ceiling

Optimization Algorithm in C#

Output of code using System.Random does not approach theoretical limit as iterations increase

Pitfalls in C# for a new user. (FWHM calculation)

C# getting first digit of int in custom class

Categories

Resources