Given two float values (fLow and fHigh), how could you calculate the greatest or maximum stride/gap between the two successive values?
For example:
In the range 16777217f to 20000000f the answer would be 2, as values are effectively rounded to the nearest two.
Generalizing this to an arbitrary range has got me scratching my head - any suggestions?
cheers,
This should be language neutral, but I'm using C# (which conforms to IEEE-754 for this, I think).
This is in C. It requires some IEEE 754 behavior, for rounding and such. For IEEE 754 64-bit binary (double), SmallestPositive is 2-1074, approximately 4.9406564584124654417656879286822137236505980261e-324, and DBL_EPSILON is 2-52, 2.220446049250313080847263336181640625e-16. For 32-bit binary (float), change DBL to FLT and double to float wherever they appear (and fabs to fabsf and fmax to fmaxf, although it should work without these changes). Then SmallestPositive is 2-149, approximately 1.401298464324817070923729583289916131280261941876515771757068283889791e-45, and FLT_EPSILON is 2-23, 1.1920928955078125e-07.
For an interval between two values, the greatest step size is of course the step size at the endpoint with larger magnitude. (If that endpoint is exactly a power of two, the step size from that point to the next does not appear in the interval itself, so that would be a special case.)
#include <float.h>
#include <math.h>
/* Return the ULP of q.
This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
05.12_, Faculty for Information and Communication Sciences, Hamburg
University of Technology, November 13, 2005.
*/
double ULP(double q)
{
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
return fmax(SmallestPositive, q - (q - q * Scale));
}
Well, machine accuracy is, as the name indicates, really something that might in general depend on the machine and even on the compiler. So, to be really sure you will typically have to write a program that actually tests what is going on.
However, I suspect that you are really looking for some handy formulas that you can use to approximate the maximum distance in a given interval. The Wikipedia article on machine epsilon gives a really nice overview over this topic and I'm mostly quoting from this source in the following.
Let s be the machine epsilon of your floating point representation (i.e., about 2^(-24) in the case of standard floats), then the maximum spacing between a normalised number x and its neighbors is 2*s*|x|. The word normalised is really crucial here and I will not even try to consider the situation for de-normalised numbers because this is where things get really nasty...
That is, in your particular case the maximum spacing h in the interval you propose is given by h = 2*s*max(|fLow|, |fHigh|).
Related
Given a floating-point number, I would like to separate it into a sum of parts, each with a given number of bits. For example, given 3.1415926535 and told to separate it into base-10 parts of 4 digits each, it would return 3.141 + 5.926E-4 + 5.350E-8. Actually, I want to separate a double (which has 52 bits of precision) into three parts with 18 bits of precision each, but it was easier to explain with a base-10 example. I am not necessarily averse to tricks that use the internal representation of a standard double-precision IEEE float, but I would really prefer a solution that stays purely in the floating point realm so as to avoid any issues with endian-dependency or non-standard floating point representations.
No, this is not a homework problem, and, yes, this has a practical use. If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type. Starting from this kind of decomposition, then multiplying all the parts and convolving, is one way to do that. Yes, I could also use an arbitrary-precision floating-point library, but this approach is likely to be faster when only a few parts are involved, and it will definitely be lighter-weight.
If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type.
Exactly. This technique can be found in Veltkamp/Dekker multiplication. While accessing the bits of the representation as in other answers is a possibility, you can also do with only floating-point operations. There is one instance in this blog post. The part you are interested in is:
Input: f; coef is 1 + 2^N
p = f * coef;
q = f - p;
h = p + q; // h contains the 53-N highest bits of f
l = f - h; // l contains the N lowest bits of f
*, -, and + must be exactly the IEEE 754 operations at the precision of f for this to work. On Intel architectures, these operations are provided by the SSE2 instruction set. Visual C sets the precision of the historical FPU to 53 bits in the prelude of the C programs it compiles, which also helps.
The c way of decomposing numbers would be abs and frexp, which remove sign and exponent. The result necessarily lies in [ 0.5 , 1.0 ). Multiplying that by 1<<N means the integer part (obtained by modf) contains the top N bits.
You can use BitConverter.DoubleToInt64Bits and C#'s bitwise operators. You seem to be familiar with IEEE floating point formats so I'll not add more detail.
I just noticed tag C. In this case, you can use a union and do pretty much the same.
The real problems you have are:
Handling the implicit leading "1". In border cases, this would lead you to +0 / -0 situations. I can predict your code will be full of special cases because of this reason.
With very low exponents, your will get them out of range even before you consider the "leading 1" problem. Even if in-range you will need to resort to subnormals. Given the big gap between normal and subnormal numbers, I also dare to predict that there will be several ranges of valid floating point numbers that will have no possible representation in this scheme.
Except as noted above, handling of the exponent should be trivial: subtract 18 and 36 for the second and third 18-bit parts (and then find the leading 1, further decreasing it, of course).
Ugly solution? IEEE 754 is ugly by itself in the border cases. Big-endian/little-endian is the least of your problems.
Personally, I think this will get too complicated for your original objective. Just stick to a simple solution to your problem: find a function that counts trailing zeroes (does the standard itself defines one? I could be confusing with a libtrary) and ensure that the sum is > 52. Yes, your requirement of "half the digits(?)" (you meant 26 bits, right?) is stronger than necessary. And also wrong because it doesn't take into account the implicit 1. This is also why above I didn't say >= 52, but > 52.
Hope this helps.
Numerically, in general, you can shift left n digits, convert to integer and subtract.
a = (3.1415926535)*1000 = 3141.5926535
b = (int) a = 3141
c = a - (double) b = 0.5926535 << can convert this to 0.5926, etc.
d = (double) b / 1000 = 3.141 << except this MIGHT NOT be exact in base 2!!
But the principal is the same if you do all the mults/divides by powers of 2.
In testing as to why my program is not working as intended, I tried typing the calculations that seem to be failing into the immediate window.
Math.Floor(1.0f)
1.0 - correct
However:
200f * 0.005f
1.0
Math.Floor(200f * 0.005f)
0.0 - incorrect
Furthermore:
(float)(200f * 0.005f)
1.0
Math.Floor((float)(200f * 0.005f))
0.0 - incorrect
Probably some float loss is occuring, 0.99963 ≠ 1.00127 for example.
I wouldn't mind storing less pricise values, but in a non lossy way, for example if there were a numeric type that stored values as integers do, but to only three decimal places, if it could be made performant.
I think probably there is a better way of calculating (n * 0.005f) in regards to such errors.
edit:
TY, a solution:
Math.Floor(200m * 0.005m)
Also, as I understand it, this would work if I didn't mind changing the 1/200 into 1/256:
Math.Floor(200f * 0.00390625f)
The solution I'm using. It's the closest I can get in my program and seems to work ok:
float x = ...;
UInt16 n = 200;
decimal d = 1m / n;
... = Math.Floor((decimal)x * d)
Floats represent numbers as fractions with powers of two in the denominator. That is, you can exactly represent 1/2, or 3/4, or 19/256. Since .005 is 1/200, and 200 is not a power of two, instead what you get for 0.005f is the closest fraction that has a power of two on the bottom that can fit into a 32 bit float.
Decimals represent numbers as fractions with powers of ten in the denominator. Like floats, they introduce errors when you try to represent numbers that do not fit that pattern. 1m/333m for example, will give you the closest number to 1/333 that has a power of ten as the denominator and 29 or fewer significant digits. Since 0.005 is 5/1000, and that is a power of ten, 0.005m will give you an exact representation. The price you pay is that decimals are much larger and slower than floats.
You should always always always use decimals for financial calculations, never floats.
The problem is that 0.005f is actually 0.004999999888241291046142578125... so less than 0.005. That's the closest float value to 0.005. When you multiply that by 200, you end up with something less than 1.
If you use decimal instead - all the time, not converting from float - you should be fine in this particular scenario. So:
decimal x = 0.005m;
decimal y = 200m;
decimal z = x * y;
Console.WriteLine(z == 1m); // True
However, don't assume that this means decimal has "infinite precision". It's still a floating point type with limited precision - it's just a floating decimal point type, so 0.005 is exactly representable.
If you cannot tolerate any floating point precision issues, use decimal.
http://msdn.microsoft.com/en-us/library/364x0z75.aspx
Ultimately even decimal has precision issues (it allows for 28-29 significant digits). If you are working in it's supported range ((-7.9 x 10^28 to 7.9 x 10^28) / (100^28)), you are quite unlikely to be impacted by them.
This question is about the threshold at which Math.Floor(double) and Math.Ceiling(double) decide to give you the previous or next integer value. I was disturbed to find that the threshold seems to have nothing to do with Double.Epsilon, which is the smallest value that can be represented with a double. For example:
double x = 3.0;
Console.WriteLine( Math.Floor( x - Double.Epsilon ) ); // expected 2, got 3
Console.WriteLine( Math.Ceiling( x + Double.Epsilon) ); // expected 4, got 3
Even multiplying Double.Epsilon by a fair bit didn't do the trick:
Console.WriteLine( Math.Floor( x - Double.Epsilon*1000 ) ); // expected 2, got 3
Console.WriteLine( Math.Ceiling( x + Double.Epsilon*1000) ); // expected 4, got 3
With some experimentation, I was able to determine that the threshold is somewhere around 2.2E-16, which is very small, but VASTLY bigger than Double.Epsilon.
The reason this question came up is that I was trying to calculate the number of digits in a number with the formula var digits = Math.Floor( Math.Log( n, 10 ) ) + 1. This formula doesn't work for n=1000 (which I stumbled on completely by accident) because Math.Log( 1000, 10 ) returns a number that's 4.44E-16 off its actual value. (I later found that the built-in Math.Log10(double) provides much more accurate results.)
Shouldn't the threshold should be tied to Double.Epsilon or, if not, shouldn't the threshold be documented (I couldn't find any mention of this in the official MSDN documentation)?
Shouldn't the threshold should be tied to Double.Epsilon
No.
The representable doubles are not uniformly distributed over the real numbers. Close to zero there are many representable values. But the further from zero you get, the further apart representable doubles are. For very large numbers even adding 1 to a double will not give you a new value.
Therefore the threshold you are looking for depends on how large your number is. It is not a constant.
The value of Double.Epsilon is 4.94065645841247e-324. Adding or subtracting this value to 3 results in 3, due to the way floating-point works.
A double has 53 bits of mantissa, so the smallest value you can add that will have any impact will be approximately 2^53 time smaller than your variable. So something around 1e-16 sounds about right (order of magnitude).
So to answer your question: there is no "threshold"; floor and ceil simply act on their argument in exactly the way you would expect.
This is going to be hand-waving rather than references to specifications, but I hope my "intuitive explanation" suits you well.
Epsilon represents the smallest magnitude that can be represented, that is different from zero. Considering the mantissa and exponent of a double, that's going to be extremely tiny -- think 10^-324. There's over three hundred zeros between the decimal point and the first non-zero digit.
However, a Double represents roughly 14-15 digits of precision. That still leaves 310 digits of zeros between Epsilon and and integers.
Doubles are fixed to a certain bit length. If you really want arbitrary precision calculations, you should use an arbitrary-precision library instead. And be prepared for it to be significantly slower -- representing all 325 digits that would be necessary to store a number such as 2+epsilon will require roughly 75 times more storage per number. That storage isn't free and calculating with it certainly cannot go at full CPU speed.
I'm messing around with Fourier transformations. Now I've created a class that does an implementation of the DFT (not doing anything like FFT atm). This is the implementation I've used:
public static Complex[] Dft(double[] data)
{
int length = data.Length;
Complex[] result = new Complex[length];
for (int k = 1; k <= length; k++)
{
Complex c = Complex.Zero;
for (int n = 1; n <= length; n++)
{
c += Complex.FromPolarCoordinates(data[n-1], (-2 * Math.PI * n * k) / length);
}
result[k-1] = 1 / Math.Sqrt(length) * c;
}
return result;
}
And these are the results I get from Dft({2,3,4})
Well it seems pretty okay, since those are the values I expect. There is only one thing I find confusing. And it all has to do with the rounding of doubles.
First of all, why are the first two numbers not exactly the same (0,8660..443 8 ) vs (0,8660..443). And why can't it calculate a zero, where you'd expect it. I know 2.8E-15 is pretty close to zero, but well it's not.
Anyone know how these, marginal, errors occur and if I can and want to do something about it.
It might seem that there's not a real problem, because it's just small errors. However, how do you deal with these rounding errors if you're for example comparing 2 values.
5,2 + 0i != 5,1961524 + i2.828107*10^-15
Cheers
I think you've already explained it to yourself - limited precision means limited precision. End of story.
If you want to clean up the results, you can do some rounding of your own to a more reasonable number of siginificant digits - then your zeros will show up where you want them.
To answer the question raised by your comment, don't try to compare floating point numbers directly - use a range:
if (Math.Abs(float1 - float2) < 0.001) {
// they're the same!
}
The comp.lang.c FAQ has a lot of questions & answers about floating point, which you might be interested in reading.
From http://support.microsoft.com/kb/125056
Emphasis mine.
There are many situations in which precision, rounding, and accuracy in floating-point calculations can work to generate results that are surprising to the programmer. There are four general rules that should be followed:
In a calculation involving both single and double precision, the result will not usually be any more accurate than single precision. If double precision is required, be certain all terms in the calculation, including constants, are specified in double precision.
Never assume that a simple numeric value is accurately represented in the computer. Most floating-point values can't be precisely represented as a finite binary value. For example .1 is .0001100110011... in binary (it repeats forever), so it can't be represented with complete accuracy on a computer using binary arithmetic, which includes all PCs.
Never assume that the result is accurate to the last decimal place. There are always small differences between the "true" answer and what can be calculated with the finite precision of any floating point processing unit.
Never compare two floating-point values to see if they are equal or not- equal. This is a corollary to rule 3. There are almost always going to be small differences between numbers that "should" be equal. Instead, always check to see if the numbers are nearly equal. In other words, check to see if the difference between them is very small or insignificant.
Note that although I referenced a microsoft document, this is not a windows problem. It's a problem with using binary and is in the CPU itself.
And, as a second side note, I tend to use the Decimal datatype instead of double: See this related SO question: decimal vs double! - Which one should I use and when?
In C# you'll want to use the 'decimal' type, not double for accuracy with decimal points.
As to the 'why'... repsensenting fractions in different base systems gives different answers. For example 1/3 in a base 10 system is 0.33333 recurring, but in a base 3 system is 0.1.
The double is a binary value, at base 2. When converting to base 10 decimal you can expect to have these rounding errors.
The following code in C# (.Net 3.5 SP1) is an infinite loop on my machine:
for (float i = 0; i < float.MaxValue; i++) ;
It reached the number 16777216.0 and 16777216.0 + 1 is evaluates to 16777216.0. Yet at this point: i + 1 != i.
This is some craziness.
I realize there is some inaccuracy in how floating point numbers are stored. And I've read that whole numbers greater 2^24 than cannot be properly stored as a float.
Still the code above, should be valid in C# even if the number cannot be properly represented.
Why does it not work?
You can get the same to happen for double but it takes a very long time. 9007199254740992.0 is the limit for double.
Right, so the issue is that in order to add one to the float, it would have to become
16777217.0
It just so happens that this is at a boundary for the radix and cannot be represented exactly as a float. (The next highest value available is 16777218.0)
So, it rounds to the nearest representable float
16777216.0
Let me put it this way:
Since you have a floating amount of precision, you have to increment up by a higher-and-higher number.
EDIT:
Ok, this is a little bit difficult to explain, but try this:
float f = float.MaxValue;
f -= 1.0f;
Debug.Assert(f == float.MaxValue);
This will run just fine, because at that value, in order to represent a difference of 1.0f, you would need over 128 bits of precision. A float has only 32 bits.
EDIT2
By my calculations, at least 128 binary digits unsigned would be necessary.
log(3.40282347E+38) * log(10) / log(2) = 128
As a solution to your problem, you could loop through two 128 bit numbers. However, this will take at least a decade to complete.
Imagine for example that a floating point number is represented by up to 2 significant decimal digits, plus an exponent: in that case, you could count from 0 to 99 exactly. The next would be 100, but because you can only have 2 significant digits that would be stored as "1.0 times 10 to the power of 2". Adding one to that would be ... what?
At best, it would be 101 as an intermediate result, which would actually be stored (via a rounding error which discards the insignificant 3rd digit) as "1.0 times 10 to the power of 2" again.
To understand what's going wrong you're going to have to read the IEEE standard on floating point
Let's examine the structure of a floating point number for a second:
A floating point number is broken into two parts (ok 3, but ignore the sign bit for a second).
You have a exponent and a mantissa. Like so:
smmmmmmmmeeeeeee
Note: that is not acurate to the number of bits, but it gives you a general idea of what's happening.
To figure out what number you have we do the following calculation:
mmmmmm * 2^(eeeeee) * (-1)^s
So what is float.MaxValue going to be? Well you're going to have the largest possible mantissa and the largest possible exponent. Let's pretend this looks something like:
01111111111111111
in actuality we define NAN and +-INF and a couple other conventions, but ignore them for a second because they're not relevant to your question.
So, what happens when you have 9.9999*2^99 + 1? Well, you do not have enough significant figures to add 1. As a result it gets rounded down to the same number. In the case of single floating point precision the point at which +1 starts to get rounded down happens to be 16777216.0
It has nothing to do with overflow, or being near the max value. The float value for 16777216.0 has a binary representation of 16777216. You then increment it by 1, so it should be 16777217.0, except that the binary representation of 16777217.0 is 16777216!!! So it doesn't actually get incremented or at least the increment doesn't do what you expect.
Here is a class written by Jon Skeet that illustrates this:
DoubleConverter.cs
Try this code with it:
double d1 = 16777217.0;
Console.WriteLine(DoubleConverter.ToExactString(d1));
float f1 = 16777216.0f;
Console.WriteLine(DoubleConverter.ToExactString(f1));
float f2 = 16777217.0f;
Console.WriteLine(DoubleConverter.ToExactString(f2));
Notice how the internal representation of 16777216.0 is the same 16777217.0!!
The iteration when i approaches float.MaxValue has i just below this value. The next iteration adds to i, but it can't hold a number bigger than float.MaxValue. Thus it holds a value much smaller, and begins the loop again.