I am writing a program that finds percentile. According to eHow:
Start to calculate the percentile of your test score (as an example we’ll stick with your score of 87). The formula to use is L/N(100) = P where L is the number of tests with scores less than 87, N is the total number of test scores (here 150) and P is the percentile. Count up the total number of test scores that are less than 87. We’ll assume the number is 113. This gives us L = 113 and N = 150.
And so, according to the instructions, I wrote:
string[] n = Interaction.InputBox("Enter the data set. The numbers do not have to be sorted.").Split(',');
List<Single> x = new List<Single> { };
foreach (string i in n)
{
x.Add(Single.Parse(i));
}
x.Sort();
List<double> lowerThan = new List<double> { };
Single score = Single.Parse(Interaction.InputBox("Enter the number."));
uint length = (uint)x.Count;
foreach (Single index in x)
{
if (index > score)
{
lowerThan.Add(index);
}
}
uint lowerThanCount = (uint)lowerThan.Count();
double percentile = lowerThanCount / length * 100;
MessageBox.Show("" + percentile);
Yet the program always returns 0 as the percentile! What errors have I made?
Your calculation
double percentile = lowerThanCount / length * 100;
is all done in integers, since the right hand side consist of all integers. Atleast one of the operand should be of floating point type. So
double percentile = (float) lowerThanCount / length * 100;
This is effectively a rounding problem, lowerThanCount / length are both unit therefore don't support decimal places so any natural percentage calculation (e.g. 0.2/0.5) would result in 0.
For example, If we were to assume lowerThanCount = 10 and length = 20, the sum would look something like
double result = (10 / 20) * 100
Therefore results in
(10 / 20) = 0.5 * 100
As 0.5 cannot be represented as an integer the floating point is truncated which leaves you with 0, so the final calculation eventually becomes
0 * 100 = 0;
You can fix this by forcing the calculation to work with a floating point type instead e.g.
double percentile = (double)lowerThanCount / length * 100
In terms of readability, it probably makes better sense to go with the cast in the calculation given lowerThanCount & length won't ever naturally be floating point numbers.
Also, your code could be simplified a lot using LINQ
string[] n = Interaction.InputBox("Enter the data set. The numbers do not have to be sorted.")
.Split(',');
IList<Single> x = n.Select(n => Single.Parse(n))
.OrderBy(x => x);
Single score = Single.Parse(Interaction.InputBox("Enter the number."));
IList<Single> lowerThan = x.Where(s => s < score);
Single percentile = (Single)lowerThan.Count / x.Count;
MessageBox.Show(percentile.ToString("%"));
The problem is in the types that you used for your variables: in this expression
double percentile = lowerThanCount / length * 100;
// ^^^^^^^^^^^^^^^^^^^^^^^
// | | |
// This is integer division; since length > lowerThanCount, its result is zero
the division is done on integers, so the result is going to be zero.
Change the type of lowerThanCount to double to fix this problem:
double lowerThanCount = (double)lowerThan.Count();
You are using integer division instead of floating point division. Cast length/lowerThanCount to a float before dividing.
Besides the percentile calculation (should be with floats), I think your count is off here:
foreach (Single index in x)
{
if (index > score)
{
lowerThan.Add(index);
}
}
You go through indexes and if they are larger than score, you put them into lowerThan
Just a logical mistake?
EDIT: for the percentile problem, here is my fix:
double percentile = ((double)lowerThanCount / (double)length) * 100.0;
You might not need all the (double)'s there, but just to be safe...
I used to think I understand the difference between decimal and double values, but now I'm not able to justify the behavior of this code snippet.
I need to divide the difference between two decimal numbers in some intervals, for example:
decimal minimum = 0.158;
decimal maximum = 64.0;
decimal delta = (maximum - minimum) / 6; // 10.640333333333333333333333333
Then I create the intervals in reverse order, but the first result is already unexpected:
for (int i = 5; i >= 0; i--)
{
Interval interval = new Interval(minimum + (delta * i), minimum + (delta * (i + 1));
}
{53.359666666666666666666666665, 63.999999999999999999999999998}
I would expect the maximum value to be exactly 64. What am I missing here?
Thank you very much!
EDIT: if I use double instead of decimal it seems to works properly!
You're not missing anything. This is the result of rounding the numbers multiple times internally, i.e. compounding loss of precision. The delta, to begin with, isn't exactly 10.640333333333333333333333333, but the 3s keep repeating endlessly, resulting in a loss of precision when you multiply or divide using this decimal.
Maybe you could do it like this instead:
for (decimal i = maximum; i >= delta; i -= delta)
{
Interval interval = new Interval(i - delta, i);
}
Double has 16 digits precision while Decimal has 29 digits precision. Thus, double is more than likely would round it off than decimal.
I am trying to calculate average for an array of floats. I need to use indices because this is inside a binary search so the top and bottom will move. (Big picture we are trying to optimize a half range estimation so we don't have to re-create the array each pass).
Anyway I wrote a custom average loop and I'm getting 2 places less accuracy than the c# Average() method
float test = input.Average();
int count = (top - bottom) + 1;//number of elements in this iteration
int pos = bottom;
float average = 0f;//working average
while (pos <= top)
{
average += input[pos];
pos++;
}
average = average / count;
example:
0.0371166766 - c#
0.03711666 - my loop
125090.148 - c#
125090.281 - my loop
http://pastebin.com/qRE3VrCt
I'm getting 2 places less accuracy than the c# Average()
No, you are only losing 1 significant digit. The float type can only store 7 significant digits, the rest are just random noise. Inevitably in a calculation like this, you can accumulate round-off error and thus lose precision. Getting the round-off errors to balance out requires luck.
The only way to avoid it is to use a floating point type that has more precision to accumulate the result. Not an issue, you have double available. Which is why the Linq Average method looks like this:
public static float Average(this IEnumerable<float> source) {
if (source == null) throw Error.ArgumentNull("source");
double sum = 0; // <=== NOTE: double
long count = 0;
checked {
foreach (float v in source) {
sum += v;
count++;
}
}
if (count > 0) return (float)(sum / count);
throw Error.NoElements();
}
Use double to reproduce the Linq result with a comparable number of significant digits in the result.
I'd rewrite this as:
int count = (top - bottom) + 1;//number of elements in this iteration
double sum = 0;
for(int i = bottom; i <= top; i++)
{
sum += input[i];
}
float average = (float)(sum/count);
That way you're using a high precision accumulator, which helps reduce rounding errors.
btw. if performance isn't that important, you can still use LINQ to calculate the average of an array slice:
input.Skip(bottom).Take(top - bottom + 1).Average()
I'm not entirely sure if that fits your problem, but if you need to calculate the average of many subarrays, it can be useful to create a persistent sum array, so calculating an average simply becomes two table lookups and a division.
Just to add to the conversation, be careful when using Floating point primitives.
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Internally floating point numbers store additional least significant bits that are not reflected in the displayed value (aka: Guard Bits or Guard Digits). They are, however, utilized when performing mathematical operations and equality checks. One common result is that a variable containing 0f is not always zero. When accumulating floating point values this can also lead to precision errors.
Use Decimal for your accumulator:
Will not have rounding errors due to Guard Digits
Is a 128bit data type (less likely to exceed Max Value in your accumulator).
For more info:
What is the difference between Decimal, Float and Double in C#?
I've used that formula for gettting a random double in custom interval:
Random r = new Random();
double Upper = 3.7, Lower = 11.4, Result;
Result = Lower + (r.NextDouble() * (Upper - Lower))
// Lower is the lower border of interval, Upper is the upper border of interval
But keep in mind what MSDN says about NextDouble method:
A double-precision floating point number greater than or equal to 0.0, and less than 1.0.
That means interval in my sample code would include 3.7, but we can never get 11.4, right?
How can I include the upper border?
Lower + (r.NextDouble() * (Upper - Lower + double.Epsilon))
Can this formula help? Or there is another variant of getting random double numbers in [3.7 ; 11.4] (including both borders) ?
Do you really need the upper interval for the double case? The odds of hitting exactly that value are really, really small, and should be statistically insignificant for almost all scenarios. If you're interested in numbers with a certain number of decimal places, then you can use some rounding to achieve what you need.
Since your using doubles what kind of precision do you actually use? Rounding the numbers might be enough. Alternatively you can use your own scaling like this:
static void Main(string[] args)
{
var r = new Random(3);
for (int i = 0; i < 100; i++)
{
Console.WriteLine(r.NextDouble(0, 1, 100));
}
Console.ReadKey();
}
public static double NextDouble(this Random r
, double lower
, double upper
, int scale = int.MaxValue - 1
)
{
var d = lower + ((r.Next(scale + 1)) * (upper - lower) / scale);
return d;
}
That will give you the lower and upper inclusive range at the specified scale. I threw in a default value for scale which gives you the highest possible precision, using this method.
The precision itself is a problem here, since 3.7 neither 11.4 have a precise double representation.
I think that since you are using random double precision number, I don't think this imprecision is something to care about.
Add the following
Result = Math.Round(Result, 8)
and voila.
The number 8 is the decimal places it will round to. When the random number is within 8 decimal places of the upper bound (example: 11.3999999990) then the result will round to the bound (answer: 11.4000000000).
Of course the round occurs for all the numbers, so choose your precision carefully. It really depends on the application if 8 decimal places is good. Your limits are 1 to 15.
.NET Framework 3.5.
I'm trying to calculate the average of some pretty large numbers.
For instance:
using System;
using System.Linq;
class Program
{
static void Main(string[] args)
{
var items = new long[]
{
long.MaxValue - 100,
long.MaxValue - 200,
long.MaxValue - 300
};
try
{
var avg = items.Average();
Console.WriteLine(avg);
}
catch (OverflowException ex)
{
Console.WriteLine("can't calculate that!");
}
Console.ReadLine();
}
}
Obviously, the mathematical result is 9223372036854775607 (long.MaxValue - 200), but I get an exception there. This is because the implementation (on my machine) to the Average extension method, as inspected by .NET Reflector is:
public static double Average(this IEnumerable<long> source)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
long num = 0L;
long num2 = 0L;
foreach (long num3 in source)
{
num += num3;
num2 += 1L;
}
if (num2 <= 0L)
{
throw Error.NoElements();
}
return (((double) num) / ((double) num2));
}
I know I can use a BigInt library (yes, I know that it is included in .NET Framework 4.0, but I'm tied to 3.5).
But I still wonder if there's a pretty straight forward implementation of calculating the average of integers without an external library. Do you happen to know about such implementation?
Thanks!!
UPDATE:
The previous example, of three large integers, was just an example to illustrate the overflow issue. The question is about calculating an average of any set of numbers which might sum to a large number that exceeds the type's max value. Sorry about this confusion. I also changed the question's title to avoid additional confusion.
Thanks all!!
This answer used to suggest storing the quotient and remainder (mod count) separately. That solution is less space-efficient and more code-complex.
In order to accurately compute the average, you must keep track of the total. There is no way around this, unless you're willing to sacrifice accuracy. You can try to store the total in fancy ways, but ultimately you must be tracking it if the algorithm is correct.
For single-pass algorithms, this is easy to prove. Suppose you can't reconstruct the total of all preceding items, given the algorithm's entire state after processing those items. But wait, we can simulate the algorithm then receiving a series of 0 items until we finish off the sequence. Then we can multiply the result by the count and get the total. Contradiction. Therefore a single-pass algorithm must be tracking the total in some sense.
Therefore the simplest correct algorithm will just sum up the items and divide by the count. All you have to do is pick an integer type with enough space to store the total. Using a BigInteger guarantees no issues, so I suggest using that.
var total = BigInteger.Zero
var count = 0
for i in values
count += 1
total += i
return total / (double)count //warning: possible loss of accuracy, maybe return a Rational instead?
If you're just looking for an arithmetic mean, you can perform the calculation like this:
public static double Mean(this IEnumerable<long> source)
{
if (source == null)
{
throw Error.ArgumentNull("source");
}
double count = (double)source.Count();
double mean = 0D;
foreach(long x in source)
{
mean += (double)x/count;
}
return mean;
}
Edit:
In response to comments, there definitely is a loss of precision this way, due to performing numerous divisions and additions. For the values indicated by the question, this should not be a problem, but it should be a consideration.
You may try the following approach:
let number of elements is N, and numbers are arr[0], .., arr[N-1].
You need to define 2 variables:
mean and remainder.
initially mean = 0, remainder = 0.
at step i you need to change mean and remainder in the following way:
mean += arr[i] / N;
remainder += arr[i] % N;
mean += remainder / N;
remainder %= N;
after N steps you will get correct answer in mean variable and remainder / N will be fractional part of the answer (I am not sure you need it, but anyway)
If you know approximately what the average will be (or, at least, that all pairs of numbers will have a max difference < long.MaxValue), you can calculate the average difference from that value instead. I take an example with low numbers, but it works equally well with large ones.
// Let's say numbers cannot exceed 40.
List<int> numbers = new List<int>() { 31 28 24 32 36 29 }; // Average: 30
List<int> diffs = new List<int>();
// This can probably be done more effectively in linq, but to show the idea:
foreach(int number in numbers.Skip(1))
{
diffs.Add(numbers.First()-number);
}
// diffs now contains { -3 -6 1 5 -2 }
var avgDiff = diffs.Sum() / diffs.Count(); // the average is -1
// To get the average value, just add the average diff to the first value:
var totalAverage = numbers.First()+avgDiff;
You can of course implement this in some way that makes it easier to reuse, for example as an extension method to IEnumerable<long>.
Here is how I would do if given this problem. First let's define very simple RationalNumber class, which contains two properties - Dividend and Divisor and an operator for adding two complex numbers. Here is how it looks:
public sealed class RationalNumber
{
public RationalNumber()
{
this.Divisor = 1;
}
public static RationalNumberoperator +( RationalNumberc1, RationalNumber c2 )
{
RationalNumber result = new RationalNumber();
Int64 nDividend = ( c1.Dividend * c2.Divisor ) + ( c2.Dividend * c1.Divisor );
Int64 nDivisor = c1.Divisor * c2.Divisor;
Int64 nReminder = nDividend % nDivisor;
if ( nReminder == 0 )
{
// The number is whole
result.Dividend = nDividend / nDivisor;
}
else
{
Int64 nGreatestCommonDivisor = FindGreatestCommonDivisor( nDividend, nDivisor );
if ( nGreatestCommonDivisor != 0 )
{
nDividend = nDividend / nGreatestCommonDivisor;
nDivisor = nDivisor / nGreatestCommonDivisor;
}
result.Dividend = nDividend;
result.Divisor = nDivisor;
}
return result;
}
private static Int64 FindGreatestCommonDivisor( Int64 a, Int64 b)
{
Int64 nRemainder;
while ( b != 0 )
{
nRemainder = a% b;
a = b;
b = nRemainder;
}
return a;
}
// a / b = a is devidend, b is devisor
public Int64 Dividend { get; set; }
public Int64 Divisor { get; set; }
}
Second part is really easy. Let's say we have an array of numbers. Their average is estimated by Sum(Numbers)/Length(Numbers), which is the same as Number[ 0 ] / Length + Number[ 1 ] / Length + ... + Number[ n ] / Length. For to be able to calculate this we will represent each Number[ i ] / Length as a whole number and a rational part ( reminder ). Here is how it looks:
Int64[] aValues = new Int64[] { long.MaxValue - 100, long.MaxValue - 200, long.MaxValue - 300 };
List<RationalNumber> list = new List<RationalNumber>();
Int64 nAverage = 0;
for ( Int32 i = 0; i < aValues.Length; ++i )
{
Int64 nReminder = aValues[ i ] % aValues.Length;
Int64 nWhole = aValues[ i ] / aValues.Length;
nAverage += nWhole;
if ( nReminder != 0 )
{
list.Add( new RationalNumber() { Dividend = nReminder, Divisor = aValues.Length } );
}
}
RationalNumber rationalTotal = new RationalNumber();
foreach ( var rational in list )
{
rationalTotal += rational;
}
nAverage = nAverage + ( rationalTotal.Dividend / rationalTotal.Divisor );
At the end we have a list of rational numbers, and a whole number which we sum together and get the average of the sequence without an overflow. Same approach can be taken for any type without an overflow for it, and there is no lost of precision.
EDIT:
Why this works:
Define: A set of numbers.
if Average( A ) = SUM( A ) / LEN( A ) =>
Average( A ) = A[ 0 ] / LEN( A ) + A[ 1 ] / LEN( A ) + A[ 2 ] / LEN( A ) + ..... + A[ N ] / LEN( 2 ) =>
if we define An to be a number that satisfies this: An = X + ( Y / LEN( A ) ), which is essentially so because if you divide A by B we get X with a reminder a rational number ( Y / B ).
=> so
Average( A ) = A1 + A2 + A3 + ... + AN = X1 + X2 + X3 + X4 + ... + Reminder1 + Reminder2 + ...;
Sum the whole parts, and sum the reminders by keeping them in rational number form. In the end we get one whole number and one rational, which summed together gives Average( A ). Depending on what precision you'd like, you apply this only to the rational number at the end.
Simple answer with LINQ...
var data = new[] { int.MaxValue, int.MaxValue, int.MaxValue };
var mean = (int)data.Select(d => (double)d / data.Count()).Sum();
Depending on the size of the set fo data you may want to force data .ToList() or .ToArray() before your process this method so it can't requery count on each pass. (Or you can call it before the .Select(..).Sum().)
If you know in advance that all your numbers are going to be 'big' (in the sense of 'much nearer long.MaxValue than zero), you can calculate the average of their distance from long.MaxValue, then the average of the numbers is long.MaxValue less that.
However, this approach will fail if (m)any of the numbers are far from long.MaxValue, so it's horses for courses...
I guess there has to be a compromise somewhere or the other. If the numbers are really getting so large then few digits of lower orders (say lower 5 digits) might not affect the result as much.
Another issue is where you don't really know the size of the dataset coming in, especially in stream/real time cases. Here I don't see any solution other then the
(previousAverage*oldCount + newValue) / (oldCount <- oldCount+1)
Here's a suggestion:
*LargestDataTypePossible* currentAverage;
*SomeSuitableDatatypeSupportingRationalValues* newValue;
*int* count;
addToCurrentAverage(value){
newValue = value/100000;
count = count + 1;
currentAverage = (currentAverage * (count-1) + newValue) / count;
}
getCurrentAverage(){
return currentAverage * 100000;
}
Averaging numbers of a specific numeric type in a safe way while also only using that numeric type is actually possible, although I would advise using the help of BigInteger in a practical implementation. I created a project for Safe Numeric Calculations that has a small structure (Int32WithBoundedRollover) which can sum up to 2^32 int32s without any overflow (the structure internally uses two int32 fields to do this, so no larger data types are used).
Once you have this sum you then need to calculate sum/total to get the average, which you can do (although I wouldn't recommend it) by creating and then incrementing by total another instance of Int32WithBoundedRollover. After each increment you can compare it to the sum until you find out the integer part of the average. From there you can peel off the remainder and calculate the fractional part. There are likely some clever tricks to make this more efficient, but this basic strategy would certainly work without needing to resort to a bigger data type.
That being said, the current implementation isn't build for this (for instance there is no comparison operator on Int32WithBoundedRollover, although it wouldn't be too hard to add). The reason is that it is just much simpler to use BigInteger at the end to do the calculation. Performance wise this doesn't matter too much for large averages since it will only be done once, and it is just too clean and easy to understand to worry about coming up with something clever (at least so far...).
As far as your original question which was concerned with the long data type, the Int32WithBoundedRollover could be converted to a LongWithBoundedRollover by just swapping int32 references for long references and it should work just the same. For Int32s I did notice a pretty big difference in performance (in case that is of interest). Compared to the BigInteger only method the method that I produced is around 80% faster for the large (as in total number of data points) samples that I was testing (the code for this is included in the unit tests for the Int32WithBoundedRollover class). This is likely mostly due to the difference between the int32 operations being done in hardware instead of software as the BigInteger operations are.
How about BigInteger in Visual J#.
If you're willing to sacrifice precision, you could do something like:
long num2 = 0L;
foreach (long num3 in source)
{
num2 += 1L;
}
if (num2 <= 0L)
{
throw Error.NoElements();
}
double average = 0;
foreach (long num3 in source)
{
average += (double)num3 / (double)num2;
}
return average;
Perhaps you can reduce every item by calculating average of adjusted values and then multiply it by the number of elements in collection. However, you'll find a bit different number of of operations on floating point.
var items = new long[] { long.MaxValue - 100, long.MaxValue - 200, long.MaxValue - 300 };
var avg = items.Average(i => i / items.Count()) * items.Count();
You could keep a rolling average which you update once for each large number.
Use the IntX library on CodePlex.
NextAverage = CurrentAverage + (NewValue - CurrentAverage) / (CurrentObservations + 1)
Here is my version of an extension method that can help with this.
public static long Average(this IEnumerable<long> longs)
{
long mean = 0;
long count = longs.Count();
foreach (var val in longs)
{
mean += val / count;
}
return mean;
}
Let Avg(n) be the average in first n number, and data[n] is the nth number.
Avg(n)=(double)(n-1)/(double)n*Avg(n-1)+(double)data[n]/(double)n
Can avoid value overflow however loss precision when n is very large.
For two positive numbers (or two negative numbers) , I found a very elegant solution from here.
where an average computation of (a+b)/2 can be replaced with a+((b-a)/2.