This question is about the threshold at which Math.Floor(double) and Math.Ceiling(double) decide to give you the previous or next integer value. I was disturbed to find that the threshold seems to have nothing to do with Double.Epsilon, which is the smallest value that can be represented with a double. For example:
double x = 3.0;
Console.WriteLine( Math.Floor( x - Double.Epsilon ) ); // expected 2, got 3
Console.WriteLine( Math.Ceiling( x + Double.Epsilon) ); // expected 4, got 3
Even multiplying Double.Epsilon by a fair bit didn't do the trick:
Console.WriteLine( Math.Floor( x - Double.Epsilon*1000 ) ); // expected 2, got 3
Console.WriteLine( Math.Ceiling( x + Double.Epsilon*1000) ); // expected 4, got 3
With some experimentation, I was able to determine that the threshold is somewhere around 2.2E-16, which is very small, but VASTLY bigger than Double.Epsilon.
The reason this question came up is that I was trying to calculate the number of digits in a number with the formula var digits = Math.Floor( Math.Log( n, 10 ) ) + 1. This formula doesn't work for n=1000 (which I stumbled on completely by accident) because Math.Log( 1000, 10 ) returns a number that's 4.44E-16 off its actual value. (I later found that the built-in Math.Log10(double) provides much more accurate results.)
Shouldn't the threshold should be tied to Double.Epsilon or, if not, shouldn't the threshold be documented (I couldn't find any mention of this in the official MSDN documentation)?
Shouldn't the threshold should be tied to Double.Epsilon
No.
The representable doubles are not uniformly distributed over the real numbers. Close to zero there are many representable values. But the further from zero you get, the further apart representable doubles are. For very large numbers even adding 1 to a double will not give you a new value.
Therefore the threshold you are looking for depends on how large your number is. It is not a constant.
The value of Double.Epsilon is 4.94065645841247e-324. Adding or subtracting this value to 3 results in 3, due to the way floating-point works.
A double has 53 bits of mantissa, so the smallest value you can add that will have any impact will be approximately 2^53 time smaller than your variable. So something around 1e-16 sounds about right (order of magnitude).
So to answer your question: there is no "threshold"; floor and ceil simply act on their argument in exactly the way you would expect.
This is going to be hand-waving rather than references to specifications, but I hope my "intuitive explanation" suits you well.
Epsilon represents the smallest magnitude that can be represented, that is different from zero. Considering the mantissa and exponent of a double, that's going to be extremely tiny -- think 10^-324. There's over three hundred zeros between the decimal point and the first non-zero digit.
However, a Double represents roughly 14-15 digits of precision. That still leaves 310 digits of zeros between Epsilon and and integers.
Doubles are fixed to a certain bit length. If you really want arbitrary precision calculations, you should use an arbitrary-precision library instead. And be prepared for it to be significantly slower -- representing all 325 digits that would be necessary to store a number such as 2+epsilon will require roughly 75 times more storage per number. That storage isn't free and calculating with it certainly cannot go at full CPU speed.
Related
I feel really stupid about this but i'm having some problems with calculating the change in % when working with negative numbers.
The calculation i'm using gives a satisfying result when the numbers are > 0.
decimal rateOfChange = (newNumber - oldNumber) / Math.Abs(oldNumber);
Lets say i have two numbers 0.476(newNumber) and -0.016(oldNumber) that's an increase of 0.492 and with my calculation the rate of change is 3 075%.
If i instead have 0.476(newNumber) and 0.001(oldNumber) that's an increase of 0.475 and my calculation will give me the rate of change of 47 500% wich seems correct.
Blue line represent example one and red line is example two. In my world blue line should have bigger % change.
How would i write this calculation to give me the correct % change when also dealing with negative numbers? I want it to handle both increases and decreases in the % change.
I understand this is a math issue and i need to work on my math
Seems to work for me.
decimal newNumber = 0.476m;
decimal oldNumber = -0.016m;
decimal increase = newNumber - oldNumber; // this is 0.492 which is true
decimal rateOfChange = increase / Math.Abs(oldNumber);
rateOfChange is equal to approximately 30.75 which is 3075% which is the correct change.
The second example works too. The increase is -0.475 which gives the rateOfChange equal to -475 which is equal to -47500% which is correct.
You are mixing two concepts: absolute and relative deviation.
You seem to expect that bigger absolute deviations imply bigger relative deviations, which is false. You also seem to think that negative numbers is the cause of the unexpected (but correct) results you are getting.
Relative deviation depends on the magnitude of the absolute deviation and the magnitude of your reference value, not its sign. You can have smaller absolute deviations that imply really big relative deviations and vice versa.
old value: 1
new value: 100
abs. deviation: 99
rel. deviation: 99
old value: .00001
new value: 1
abs deviation: .99999
rel deviation: 99999
Here comes a silly question. I'm playing with the parse function of System.Single and it behaves unexpected which might be because I don't really understand floating-point numbers. The MSDN page of System.Single.MaxValue states that the max value is 3.402823e38, in standard form that is
340282300000000000000000000000000000000
If I use this string as an argument for the Parse() method, it will succeed without error, if I change any of the zeros to an arbitrary digit it will still succeed without error (although it seems to ignore them looking at the result). In my understanding, that exceeds the limit, so What am I missing?
It may be easier to think about this by looking at some lower numbers. All (positive) integers up to 16777216 can be exactly represented in a float. After that point, only every other integer can be represented (up to the next time we hit a limit, at which point it's only every 4th integer that can be represented).
So what has to happen then is the 16777218 has to stand for 16777218∓1, 16777220 has to stand for 16777220∓1, etc. As you move up into even larger numbers, the range of integers that each value has to "represent" grows wider and wider - until the point where 340282300000000000000000000000000000000 represents all numbers in the range 340282300000000000000000000000000000000∓100000000000000000000000000000000, approximately (I've not actually worked out what the right ∓ value is here, but hopefully you get the point)
Number Significand Exponent
16777215 = 1 11111111111111111111111 2^0 = 111111111111111111111111
16777216 = 1 00000000000000000000000 2^1 = 1000000000000000000000000
16777218 = 1 00000000000000000000001 2^1 = 1000000000000000000000010
^
|
Implicit leading bit
That's actually not true - change the first 0 to 9 and you will see an exception. Actually change it to anything 6 and up and it blows up.
Any other number is just rounded down as float is not an 100% accurate representation of a decimal with 38+1 positions that's fine.
A floating point number is not like a decimal. It comprises a mantissa that carries the significant digits and an exponent that effectively says how far left or right of the decimal point to place the mantissa. A System.Single can only handle seven significant digits in the mantissa. If you replace any of your trailing zeroes with an arbitrary digit it is being lost when your decimal is converted into the mantissa and exponent form.
Good question. That is happening because the fact you can save a number with that range doesn't mean this type'll have enough precision to hold it. You can only store ~6-7 leading digits for floats and add an exponent to describe decimal point position.
0.012345 and 1234500 hold the same amount of informations - same mantissa, different exponents. The MSDN states only that value AFTRER EXPONENTIATION cannot be bigger, than MaxValue.
Given a floating-point number, I would like to separate it into a sum of parts, each with a given number of bits. For example, given 3.1415926535 and told to separate it into base-10 parts of 4 digits each, it would return 3.141 + 5.926E-4 + 5.350E-8. Actually, I want to separate a double (which has 52 bits of precision) into three parts with 18 bits of precision each, but it was easier to explain with a base-10 example. I am not necessarily averse to tricks that use the internal representation of a standard double-precision IEEE float, but I would really prefer a solution that stays purely in the floating point realm so as to avoid any issues with endian-dependency or non-standard floating point representations.
No, this is not a homework problem, and, yes, this has a practical use. If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type. Starting from this kind of decomposition, then multiplying all the parts and convolving, is one way to do that. Yes, I could also use an arbitrary-precision floating-point library, but this approach is likely to be faster when only a few parts are involved, and it will definitely be lighter-weight.
If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type.
Exactly. This technique can be found in Veltkamp/Dekker multiplication. While accessing the bits of the representation as in other answers is a possibility, you can also do with only floating-point operations. There is one instance in this blog post. The part you are interested in is:
Input: f; coef is 1 + 2^N
p = f * coef;
q = f - p;
h = p + q; // h contains the 53-N highest bits of f
l = f - h; // l contains the N lowest bits of f
*, -, and + must be exactly the IEEE 754 operations at the precision of f for this to work. On Intel architectures, these operations are provided by the SSE2 instruction set. Visual C sets the precision of the historical FPU to 53 bits in the prelude of the C programs it compiles, which also helps.
The c way of decomposing numbers would be abs and frexp, which remove sign and exponent. The result necessarily lies in [ 0.5 , 1.0 ). Multiplying that by 1<<N means the integer part (obtained by modf) contains the top N bits.
You can use BitConverter.DoubleToInt64Bits and C#'s bitwise operators. You seem to be familiar with IEEE floating point formats so I'll not add more detail.
I just noticed tag C. In this case, you can use a union and do pretty much the same.
The real problems you have are:
Handling the implicit leading "1". In border cases, this would lead you to +0 / -0 situations. I can predict your code will be full of special cases because of this reason.
With very low exponents, your will get them out of range even before you consider the "leading 1" problem. Even if in-range you will need to resort to subnormals. Given the big gap between normal and subnormal numbers, I also dare to predict that there will be several ranges of valid floating point numbers that will have no possible representation in this scheme.
Except as noted above, handling of the exponent should be trivial: subtract 18 and 36 for the second and third 18-bit parts (and then find the leading 1, further decreasing it, of course).
Ugly solution? IEEE 754 is ugly by itself in the border cases. Big-endian/little-endian is the least of your problems.
Personally, I think this will get too complicated for your original objective. Just stick to a simple solution to your problem: find a function that counts trailing zeroes (does the standard itself defines one? I could be confusing with a libtrary) and ensure that the sum is > 52. Yes, your requirement of "half the digits(?)" (you meant 26 bits, right?) is stronger than necessary. And also wrong because it doesn't take into account the implicit 1. This is also why above I didn't say >= 52, but > 52.
Hope this helps.
Numerically, in general, you can shift left n digits, convert to integer and subtract.
a = (3.1415926535)*1000 = 3141.5926535
b = (int) a = 3141
c = a - (double) b = 0.5926535 << can convert this to 0.5926, etc.
d = (double) b / 1000 = 3.141 << except this MIGHT NOT be exact in base 2!!
But the principal is the same if you do all the mults/divides by powers of 2.
The Double data type cannot correctly represent some base 10 values. This is because of how floating point numbers represent real numbers. What this means is that when representing monetary values, one should use the decimal value type to prevent errors. (feel free to correct errors in this preamble)
What I want to know is what are the values which present such a problem under the Double data-type under a 64 bit architecture in the standard .Net framework (C# if that makes a difference) ?
I expect the answer the be a formula or rule to find such values but I would also like some example values.
Any number which cannot be written as the sum of positive and negative powers of 2 cannot be exactly represented as a binary floating-point number.
The common IEEE formats for 32- and 64-bit representations of floating-point numbers impose further constraints; they limit the number of binary digits in both the significand and the exponent. So there are maximum and minimum representable numbers (approximately +/- 10^308 (base-10) if memory serves) and limits to the precision of a number that can be represented. This limit on the precision means that, for 64-bit numbers, the difference between the exponent of the largest power of 2 and the smallest power in a number is limited to 52, so if your number includes a term in 2^52 it can't also include a term in 2^-1.
Simple examples of numbers which cannot be exactly represented in binary floating-point numbers include 1/3, 2/3, 1/5.
Since the set of floating-point numbers (in any representation) is finite, and the set of real numbers is infinite, one algorithm to find a real number which is not exactly representable as a floating-point number is to select a real number at random. The probability that the real number is exactly representable as a floating-point number is 0.
You generally need to be prepared for the possibility that any value you store in a double has some small amount of error. Unless you're storing a constant value, chances are it could be something with at least some error. If it's imperative that there never be any error, and the values aren't constant, you probably shouldn't be using a floating point type.
What you probably should be asking in many cases is, "How do I deal with the minor floating point errors?" You'll want to know what types of operations can result in a lot of error, and what types don't. You'll want to ensure that comparing two values for "equality" actually just ensures they are "close enough" rather than exactly equal, etc.
This question actually goes beyond any single programming language or platform. The inaccuracy is actually inherent in binary data.
Consider that with a double, each number N to the left (at 0-based index I) of the decimal point represents the value N * 2^I and every digit to the right of the decimal point represents the value N * 2^(-I).
As an example, 5.625 (base 10) would be 101.101 (base 2).
Given this calculation, and decimal value that can't be calculated as a sum of 2^(-I) for different values of I would have an incorrect value as a double.
A float is represented as s, e and m in the following formula
s * m * 2^e
This means that any number that cannot be represented using the given expression (and in the respective domains of s, e and m) cannot be represented exactly.
Basically, you can represent all numbers between 0 and 2^53 - 1 multiplied by a certain power of two (possibly a negative power).
As an example, all numbers between 0 and 2^53 - 1 can be represented multiplied with 2^0 = 1. And you can also represent all those numbers by dividing them by 2 (with a .5 fraction). And so on.
This answer does not fully cover the topic, but I hope it helps.
The following code in C# (.Net 3.5 SP1) is an infinite loop on my machine:
for (float i = 0; i < float.MaxValue; i++) ;
It reached the number 16777216.0 and 16777216.0 + 1 is evaluates to 16777216.0. Yet at this point: i + 1 != i.
This is some craziness.
I realize there is some inaccuracy in how floating point numbers are stored. And I've read that whole numbers greater 2^24 than cannot be properly stored as a float.
Still the code above, should be valid in C# even if the number cannot be properly represented.
Why does it not work?
You can get the same to happen for double but it takes a very long time. 9007199254740992.0 is the limit for double.
Right, so the issue is that in order to add one to the float, it would have to become
16777217.0
It just so happens that this is at a boundary for the radix and cannot be represented exactly as a float. (The next highest value available is 16777218.0)
So, it rounds to the nearest representable float
16777216.0
Let me put it this way:
Since you have a floating amount of precision, you have to increment up by a higher-and-higher number.
EDIT:
Ok, this is a little bit difficult to explain, but try this:
float f = float.MaxValue;
f -= 1.0f;
Debug.Assert(f == float.MaxValue);
This will run just fine, because at that value, in order to represent a difference of 1.0f, you would need over 128 bits of precision. A float has only 32 bits.
EDIT2
By my calculations, at least 128 binary digits unsigned would be necessary.
log(3.40282347E+38) * log(10) / log(2) = 128
As a solution to your problem, you could loop through two 128 bit numbers. However, this will take at least a decade to complete.
Imagine for example that a floating point number is represented by up to 2 significant decimal digits, plus an exponent: in that case, you could count from 0 to 99 exactly. The next would be 100, but because you can only have 2 significant digits that would be stored as "1.0 times 10 to the power of 2". Adding one to that would be ... what?
At best, it would be 101 as an intermediate result, which would actually be stored (via a rounding error which discards the insignificant 3rd digit) as "1.0 times 10 to the power of 2" again.
To understand what's going wrong you're going to have to read the IEEE standard on floating point
Let's examine the structure of a floating point number for a second:
A floating point number is broken into two parts (ok 3, but ignore the sign bit for a second).
You have a exponent and a mantissa. Like so:
smmmmmmmmeeeeeee
Note: that is not acurate to the number of bits, but it gives you a general idea of what's happening.
To figure out what number you have we do the following calculation:
mmmmmm * 2^(eeeeee) * (-1)^s
So what is float.MaxValue going to be? Well you're going to have the largest possible mantissa and the largest possible exponent. Let's pretend this looks something like:
01111111111111111
in actuality we define NAN and +-INF and a couple other conventions, but ignore them for a second because they're not relevant to your question.
So, what happens when you have 9.9999*2^99 + 1? Well, you do not have enough significant figures to add 1. As a result it gets rounded down to the same number. In the case of single floating point precision the point at which +1 starts to get rounded down happens to be 16777216.0
It has nothing to do with overflow, or being near the max value. The float value for 16777216.0 has a binary representation of 16777216. You then increment it by 1, so it should be 16777217.0, except that the binary representation of 16777217.0 is 16777216!!! So it doesn't actually get incremented or at least the increment doesn't do what you expect.
Here is a class written by Jon Skeet that illustrates this:
DoubleConverter.cs
Try this code with it:
double d1 = 16777217.0;
Console.WriteLine(DoubleConverter.ToExactString(d1));
float f1 = 16777216.0f;
Console.WriteLine(DoubleConverter.ToExactString(f1));
float f2 = 16777217.0f;
Console.WriteLine(DoubleConverter.ToExactString(f2));
Notice how the internal representation of 16777216.0 is the same 16777217.0!!
The iteration when i approaches float.MaxValue has i just below this value. The next iteration adds to i, but it can't hold a number bigger than float.MaxValue. Thus it holds a value much smaller, and begins the loop again.