Related
I have a database table that needs to be converted into current form. This table has three columns that are of type Double (it's Pervasive.SQL, if anyone cares).
My problem is that this table has been around for a long time, and it's been acted upon by code going back some 15 years or better.
Historically, we have always used Double.MinValue (or whatever language equivalent at the time) to represent "blank" values provided by the user. The absence of a value, in other words, is actually stored as a value that we can recognize later and react to intelligently.
So, today my problem is that I need to loop through these records and insert them into a newly created table (this is the "conversion" I spoke of). However, I am not seeing consistent values in the tables I am converting. Here are the ones I know of for sure:
2.2250738585072014E-308
3.99285938963E-313
3.99099435427E-313
1.1125369292536007E-308
-5.389000690742776E279
2.104687961E-314
Now, I recognize that there are other ways that Double.MinValue might exist or at least be represented. Having done some google searches, I found that the first one is another representation of Double.MinValue (actually DBL_MIN referenced here: http://msdn.microsoft.com/en-us/library/6bs3y5ya(v=vs.100).aspx).
I don't want to get too long-winded, so I'll solicit questions if this is not enough information to help me. Suffice it to say, I need a reliable way of spotting all of the previous values of "minimum" and replace them with the C# Double.MinValue constant as I am looping these data rows.
If it proves to be dataRow["Value"] < someConstant, then so be it. But I'll let the math theorists help me out with that determination.
Thank you for the time.
EDIT:
Here's what I am doing with these values as I find them. It's part of a generic method that assembles values to be written to the database:
else if (c.DataType == typeof(System.Double))
{
if (inRow[c] == DBNull.Value)
retString += #"NULL";
else
{
Double d;
if (Double.TryParse(inRow[c].ToString(), out d))
retString += d.ToStringFull();
}
}
Until now, it simply accepted them. And that's bad because when the application finds them, they look like acceptable data, and not like Double.MinValue. Therefore, not seen as blanks. But that's what they are.
This is utter craziness. Let's look at some of those numbers in detail. These are all tiny numbers just barely larger than zero:
2.2250738585072014E-308
This is 1 / 21022 -- it is a normal double. This is one of the two "special" numbers in your set; it is the smallest normal double that is larger than zero. The rest of the small doubles on your list are subnormal doubles.
1.1125369292536007E-308
This is 1 / 21023 -- it is a subnormal double. This is also a special number; it is half the smallest normal double larger than zero. (I originally said that it was the largest subnormal double but of course that is not true; see the comments.)
3.99285938963E-313
This isn't anything special. It's a subnormal double equal to a fraction where the numerator is 154145 and the denominator is a rather large power of two.
3.99099435427E-313
This isn't anything special either. This time the numerator is 154073.
2.104687961E-314
This isn't anything special either. The numerator is 2129967929 and the denominator is an even larger power of two.
All the numbers so far have been very close to zero and positive. This number is very far from zero and negative, and therefore stands out:
-5.389000690742776E279
But again it is nothing special; it is nowhere even close to the negative double with the largest absolute value, which is about -1.79E308, about a billion times larger.
This is a complete mess.
My advice is stop this madness immediately. It makes absolutely no sense to use values that are incredibly close to zero to represent "blank" values; values that are incredibly close to zero should be rounded to zero, not treated as blanks!
Double already has a representative for "blank" values, namely Double.NaN -- Not A Number; it is bizarre to use a valid value to represent an invalid value when the domain already includes a specific "invalid" value. (Remember that there are actually a large number of distinct NaN bit patterns; use IsNaN to determine if a double is a NaN.)
So my advice is:
Examine individually every number in the database that is a subnormal or very small normal double. Some of those probably ought to be zero and ended up as tiny values due to rounding errors. Replace them with zero. The ones that ought to be blank, replace with database null (best practice) or double NaN (acceptable, but not as good as database null.)
Write a program to find every number in the database that is impossibly large in absolute value and replace it with database null or double NaN.
Update all clients so that they understand the convention you're using to represent blank values.
You seem to want to check if a double is really small and positive or really big, finite, and negative. (Others have detailed some problems with your approach in the comments; I'm not going to go into that here.) A test like this:
if (d == d && (d > 0 && d < 1e-290 || d < -1e270 && d + d != d))
might do roughly what you want. You'll probably need to tweak the numbers above. The d == d test is checking for NaN, while the d + d != d test is checking for infinities.
I have the next function:
static bool isPowerOf(int num, int power)
{
double b = 1.0 / power;
double a = Math.Pow(num, b);
Console.WriteLine(a);
return a == (int)a;
}
I inserted the print function for analysis.
If I call the function:
isPowerOf(25, 2)
It return true since 5^2 equals 25.
But, if I call 16807, which is 7^5, the next way:
isPowerOf(16807, 5)
In this case, it prints '7' but a == (int)a return false.
Can you help? Thanks!
Try using a small epsilon for rounding errors:
return Math.Abs(a - (int)a) < 0.0001;
As harold suggested, it will be better to round in case a happens to be slightly smaller than the integer value, like 3.99999:
return Math.Abs(a - Math.Round(a)) < 0.0001;
Comparisons that fix the issue have been suggested, but what's actually the problem here is that floating point should not be involved at all. You want an exact answer to a question involving integers, not an approximation of calculations done on inherently inaccurate measurements.
So how else can this be done?
The first thing that comes to mind is a cheat:
double guess = Math.Pow(num, 1.0 / power);
return num == exponentiateBySquaring((int)guess, power) ||
num == exponentiateBySquaring((int)Math.Ceil(guess), power);
// do NOT replace exponentiateBySquaring with Math.Pow
It'll work as long as the guess is less than 1 off. But I can't guarantee that it will always work for your inputs, because that condition is not always met.
So here's the next thing that comes to mind: a binary search (the variant where you search for the upper boundary first) for the base in exponentiateBySquaring(base, power) for which the result is closest to num. If and only if the closest answer is equal to num (and they are both integers, so this comparison is clean), then num is a power-th power. Unless there is overflow (there shouldn't be), that should always work.
Math.Pow operates on doubles, so rounding errors come into play when taking roots. If you want to check that you've found an exact power:
perform the Math.Pow as currently, to extract the root
round the result to the nearest integer
raise this integer to the supplied power, and check you get the supplied target. Math.Pow will be exact for numbers in the range of int when raising to integer powers
If you debug the code and then you can see that in first comparison:
isPowerOf(25, 2)
a is holding 5.0
Here 5.0 == 5 => that is why you get true
and in 2nd isPowerOf(16807, 5)
a is holding 7.0000000000000009
and since 7.0000000000000009 != 7 => you are getting false. and Console.WriteLine(a) is truncating/rounding the double and only show 7
That is why you need to compare the nearest value like in Dani's solution
Consider this
int i = 2147483647;
var n = i + 3;
i = n;
Console.WriteLine(i); // prints -2147483646 (1)
Console.WriteLine(n); // prints -2147483646 (2)
Console.WriteLine(n.GetType()); // prints System.Int32 (3)
I am confused with following
(1) how could int hold the value -2147483646 ? (int range = -2,147,483,648 to 2,147,483,647)
(2) why does this print -2147483648 but not 2147483648 (compiler should
decide better type as int range
exceeds)
(3) if it is converted somewhere, why n.GetType() gives System.Int32
?
Edit1: Made the correction: Now you will get What I am Getting. (sorry for that)
var n = i + 1; to
var n = i + 3;
Edit2: One more thing, if it as overflow, why is an exception not raised ?
Addition: as the overflow occurs, is it not right to set the type for
var n
in statement var n = i + 3; to another type accordingly ?
you are welcome to suggest a better title, as this is not making sense to.... me at least
Thanks
Update: Poster fixed his question.
1) This is output is expected because you added 3 to int.MaxValue causing an overflow. In .NET by default this is a legal operation in unchecked code giving a wrap-around to negative values, but if you add a checked block around the code it will throw an OverflowException instead.
2) The type of a variable declared with var is determined at compile time not runtime. It's a rule that adding two Int32s gives an Int32, not a UInt32, an Int64 or something else. So even though at runtime you can see that the result is too big for an Int32, it still has to return an Int32.
3) It's not converted to another type.
1) -2147483646 is bigger than -2,147,483,648
2) 2147483648 is out of range
3) int is an alias for Int32
1)
First of all, the value in the variable is not -2147483646, it's -2147483648. Run your test again and check the result.
There is no reason that an int could not hold the value -2147483646. It's within the range -2147483648..2147483647.
2)
The compiler chooses the data type of the variable to be the type of the result of the expression. The expression returns an int value, and even if the compiler would choose a larger data type for the variable, the expression still returns an int and you get the same value as result.
It's the operation in the expression that overflows, it's not when the result is assigned to the variable that it overflows.
3)
It's not converted anywhere.
This is an overflow, your number wrapped around and went negative
This isn't the compiler's job, as a loop at runtime can cause the same thing
int is an alias or System.Int32 they are equivalent in .Net.
This is because of the bit representation
you use Int32 but the same goes for char (8 bits)
the first bit holds the sign, then the following bits hold the number
so with 7 bits you can represent 128 numbers 0111 1111
when you try the 129th, 1000 0001, the sign bits get set so the computer thinks its -1 instead
Arithmic operations in .NET don't change the actual type.
You start off with an (32bit) integer and the +3 isn't going to change that.
That's also why you get an unexpected round number when you do this:
int a = 2147483647;
double b = a / 4;
or
int a = 2147483647;
var b = a / 4;
for that matter.
EDIT:
There is no exception because .NET overflows the number.
The overflow exception will only occur at assignment operations or as Mark explains when you set the conditions to generate the exception.
If you want an exception to be thrown, write
abc = checked(i+3)
instead. That will check for overflows.
Also, in c#, the default setting is to not throw exceptions on overflows. But you can switch that option somewhere on your project's properties.
You could make this easier on us all by using hex notation.
Not everyone knows that the eighth Mersenne prime is 0x7FFFFFFF
Just sayin'
What I mean is: Imagine we have a 8 byte variable that has a high value and low value. I can make one pointer point to the upper 4 bytes and other point to the lower 4 bytes, and set/retrieve their values without problems. Now, is there a way to get/set values for anything smaller than a byte? If instead of dividing it in two 4 bytes "variables", I'd want to consider eight 1 byte variables I could use a bool, but there is no defined smaller variable in c#. Would it possible to divide it to 16 just with pointers? Or even in 32, 64? It wouldn't right?
This is a pretty academic question, I know this can be achieved otherwise with bitshiffting, unions(Struct.Explicit), etc. Thanks!
No, C# does not support bit fields and a byte is the minimum amount of addressable memory. You can manually provide properties that change one or several specific bits but you have to provide packing/unpacking logic yourself:
public bool Bit5 {
get { return (field & 32) != 0; }
set { if (value) field |= 32; else field &= ~32; }
}
By the way, I don't know how you achieve it using LayoutKind.Explicit as the minimum FieldOffset you can specify is one byte.
As a side note, even C++ that can do this with bit fields will just hide the bitwise tricks and makes the compiler do it instead of you. There's no way you could grab something less than a byte from memory to a register, at least on x86 architecture.
I have a large set of numbers, probably in the multiple gigabytes range. First issue is that I can't store all of these in memory. Second is that any attempt at addition of these will result in an overflow. I was thinking of using more of a rolling average, but it needs to be accurate. Any ideas?
These are all floating point numbers.
This is not read from a database, it is a CSV file collected from multiple sources. It has to be accurate as it is stored as parts of a second (e.g; 0.293482888929) and a rolling average can be the difference between .2 and .3
It is a set of #'s representing how long users took to respond to certain form actions. For example when showing a messagebox, how long did it take them to press OK or Cancel. The data was sent to me stored as seconds.portions of a second; 1.2347 seconds for example. Converting it to milliseconds and I overflow int, long, etc.. rather quickly. Even if I don't convert it, I still overflow it rather quickly. I guess the one answer below is correct, that maybe I don't have to be 100% accurate, just look within a certain range inside of a sepcific StdDev and I would be close enough.
You can sample randomly from your set ("population") to get an average ("mean"). The accuracy will be determined by how much your samples vary (as determined by "standard deviation" or variance).
The advantage is that you have billions of observations, and you only have to sample a fraction of them to get a decent accuracy or the "confidence range" of your choice. If the conditions are right, this cuts down the amount of work you will be doing.
Here's a numerical library for C# that includes a random sequence generator. Just make a random sequence of numbers that reference indices in your array of elements (from 1 to x, the number of elements in your array). Dereference to get the values, and then calculate your mean and standard deviation.
If you want to test the distribution of your data, consider using the Chi-Squared Fit test or the K-S test, which you'll find in many spreadsheet and statistical packages (e.g., R). That will help confirm whether this approach is usable or not.
Integers or floats?
If they're integers, you need to accumulate a frequency distribution by reading the numbers and recording how many of each value you see. That can be averaged easily.
For floating point, this is a bit of a problem. Given the overall range of the floats, and the actual distribution, you have to work out a bin-size that preserves the accuracy you want without preserving all of the numbers.
Edit
First, you need to sample your data to get a mean and a standard deviation. A few thousand points should be good enough.
Then, you need to determine a respectable range. Folks pick things like ±6σ (standard deviations) around the mean. You'll divide this range into as many buckets as you can stand.
In effect, the number of buckets determines the number of significant digits in your average. So, pick 10,000 or 100,000 buckets to get 4 or 5 digits of precision. Since it's a measurement, odds are good that your measurements only have two or three digits.
Edit
What you'll discover is that the mean of your initial sample is very close to the mean of any other sample. And any sample mean is close to the population mean. You'll note that most (but not all) of your means are with 1 standard deviation of each other.
You should find that your measurement errors and inaccuracies are larger than your standard deviation.
This means that a sample mean is as useful as a population mean.
Wouldn't a rolling average be as accurate as anything else (discounting rounding errors, I mean)? It might be kind of slow because of all the dividing.
You could group batches of numbers and average them recursively. Like average 100 numbers 100 times, then average the result. This would be less thrashing and mostly addition.
In fact, if you added 256 or 512 at once you might be able to bit-shift the result by either 8 or 9, (I believe you could do this in a double by simply changing the floating point mantissa)--this would make your program extremely quick and it could be written recursively in just a few lines of code (not counting the unsafe operation of the mantissa shift).
Perhaps dividing by 256 would already use this optimization? I may have to speed test dividing by 255 vs 256 and see if there is some massive improvement. I'm guessing not.
You mean of 32-bit and 64-bit numbers. But why not just use a proper Rational Big Num library? If you have so much data and you want an exact mean, then just code it.
class RationalBignum {
public Bignum Numerator { get; set; }
public Bignum Denominator { get; set; }
}
class BigMeanr {
public static int Main(string[] argv) {
var sum = new RationalBignum(0);
var n = new Bignum(0);
using (var s = new FileStream(argv[0])) {
using (var r = new BinaryReader(s)) {
try {
while (true) {
var flt = r.ReadSingle();
rat = new RationalBignum(flt);
sum += rat;
n++;
}
}
catch (EndOfStreamException) {
break;
}
}
}
Console.WriteLine("The mean is: {0}", sum / n);
}
}
Just remember, there are more numeric types out there than the ones your compiler offers you.
You could break the data into sets of, say, 1000 numbers, average these, and then average the averages.
This is a classic divide-and-conquer type problem.
The issue is that the average of a large set of numbers is the same
as the average of the first-half of the set, averaged with the average of the second-half of the set.
In other words:
AVG(A[1..N]) == AVG( AVG(A[1..N/2]), AVG(A[N/2..N]) )
Here is a simple, C#, recursive solution.
Its passed my tests, and should be completely correct.
public struct SubAverage
{
public float Average;
public int Count;
};
static SubAverage AverageMegaList(List<float> aList)
{
if (aList.Count <= 500) // Brute-force average 500 numbers or less.
{
SubAverage avg;
avg.Average = 0;
avg.Count = aList.Count;
foreach(float f in aList)
{
avg.Average += f;
}
avg.Average /= avg.Count;
return avg;
}
// For more than 500 numbers, break the list into two sub-lists.
SubAverage subAvg_A = AverageMegaList(aList.GetRange(0, aList.Count/2));
SubAverage subAvg_B = AverageMegaList(aList.GetRange(aList.Count/2, aList.Count-aList.Count/2));
SubAverage finalAnswer;
finalAnswer.Average = subAvg_A.Average * subAvg_A.Count/aList.Count +
subAvg_B.Average * subAvg_B.Count/aList.Count;
finalAnswer.Count = aList.Count;
Console.WriteLine("The average of {0} numbers is {1}",
finalAnswer.Count, finalAnswer.Average);
return finalAnswer;
}
The trick is that you're worried about an overflow. In that case, it all comes down to order of execution. The basic formula is like this:
Given:
A = current avg
C = count of items
V = next value in the sequence
The next average (A1) is:
(C * A) + V
A1 = ———————————
C + 1
The danger is over the course of evaulating the sequence, while A should stay relatively manageable C will become very large.
Eventually C * A will overflow the integer or double types.
One thing we can try is to re-write it like this, to reduce the chance of an overflow:
A1 = C/(C+1) * A/(C+1) + V/(C+1)
In this way, we never multiply C * A and only deal with smaller numbers. But the concern now is the result of the division operations. If C is very large, C/C+1 (for example) may not be meaningful when constrained to normal floating point representations. The best I can suggest is to use the largest type possible for C here.
Here's one way to do it in pseudocode:
average=first
count=1
while more:
count+=1
diff=next-average
average+=diff/count
return average
Sorry for the late comment, but isn't it the formula above provided by Joel Coehoorn rewritten wrongly?
I mean, the basic formula is right:
Given:
A = current avg
C = count of items
V = next value in the sequence
The next average (A1) is:
A1 = ( (C * A) + V ) / ( C + 1 )
But instead of:
A1 = C/(C+1) * A/(C+1) + V/(C+1)
shouldn't we have:
A1 = C/(C+1) * A + V/(C+1)
That would explain kastermester's post:
"My math ticks off here - You have C, which you say "go towards infinity" or at least, a really big number, then: C/(C+1) goes towards 1. A /(C+1) goes towards 0. V/(C+1) goes towards 0. All in all: A1 = 1 * 0 + 0 So put shortly A1 goes towards 0 - seems a bit off. – kastermester"
Because we would have A1 = 1 * A + 0, i.e., A1 goes towards A, which it's right.
I've been using such method for calculating averages for a long time and the aforementioned precision problems have never been an issue for me.
With floating point numbers the problem is not overflow, but loss of precision when the accumulated value gets large. Adding a small number to a huge accumulated value will result in losing most of the bits of the small number.
There is a clever solution by the author of the IEEE floating point standard himself, the Kahan summation algorithm, which deals exactly with this kind of problems by checking the error at each step and keeping a running compensation term that prevents losing the small values.
If the numbers are int's, accumulate the total in a long. If the numbers are long's ... what language are you using? In Java you could accumulate the total in a BigInteger, which is an integer which will grow as large as it needs to be. You could always write your own class to reproduce this functionality. The gist of it is just to make an array of integers to hold each "big number". When you add two numbers, loop through starting with the low-order value. If the result of the addition sets the high order bit, clear this bit and carry the one to the next column.
Another option would be to find the average of, say, 1000 numbers at a time. Hold these intermediate results, then when you're done average them all together.
Why is a sum of floating point numbers overflowing? In order for that to happen, you would need to have values near the max float value, which sounds odd.
If you were dealing with integers I'd suggest using a BigInteger, or breaking the set into multiple subsets, recursively averaging the subsets, then averaging the averages.
If you're dealing with floats, it gets a bit weird. A rolling average could become very inaccurate. I suggest using a rolling average which is only updated when you hit an overflow exception or the end of the set. So effectively dividing the set into non-overflowing sets.
Two ideas from me:
If the numbers are ints, use an arbitrary precision library like IntX - this could be too slow, though
If the numbers are floats and you know the total amount, you can divide each entry by that number and add up the result. If you use double, the precision should be sufficient.
Why not just scale the numbers (down) before computing the average?
If I were to find the mean of billions of doubles as accurately as possible, I would take the following approach (NOT TESTED):
Find out 'M', an upper bound for log2(nb_of_input_data). If there are billions of data, 50 may be a good candidate (> 1 000 000 billions capacity). Create an L1 array of M double elements. If you're not sure about M, creating an extensible list will solve the issue, but it is slower.
Also create an associated L2 boolean array (all cells set to false by default).
For each incoming data D:
int i = 0;
double localMean = D;
while (L2[i]) {
L2[i] = false;
localMean = (localMean + L1[i]) / 2;
i++;
}
L1[i] = localMean;
L2[i] = true;
And your final mean will be:
double sum = 0;
double totalWeight = 0;
for (int i = 0; i < 50) {
if (L2[i]) {
long weight = 1 << i;
sum += L1[i] * weight;
totalWeight += weight;
}
}
return sum / totalWeight;
Notes:
Many proposed solutions in this thread miss the point of lost precision.
Using binary instead of 100-group-or-whatever provides better precision, and doubles can be safely doubled or halved without losing precision!
Try this
Iterate through the numbers incrementing a counter, and adding each number to a total, until adding the next number would result in an overflow, or you run out of numbers.
( It makes no difference if the inputs are integers or floats - use the largest precision float you can and convert each input to that type)
Divide the total by the counter to get a mean ( a floating point), and add it to a temp array
If you had run out of numbers, and there is only one element in temp, that's your result.
Start over using the temp array as input, ie iteratively recurse until you reached the end condition described earlier.
depending on the range of numbers it might be a good idea to have an array where the subscript is your number and the value is the quantity of that number, you could then do your calculation from this