Given a floating-point number, I would like to separate it into a sum of parts, each with a given number of bits. For example, given 3.1415926535 and told to separate it into base-10 parts of 4 digits each, it would return 3.141 + 5.926E-4 + 5.350E-8. Actually, I want to separate a double (which has 52 bits of precision) into three parts with 18 bits of precision each, but it was easier to explain with a base-10 example. I am not necessarily averse to tricks that use the internal representation of a standard double-precision IEEE float, but I would really prefer a solution that stays purely in the floating point realm so as to avoid any issues with endian-dependency or non-standard floating point representations.
No, this is not a homework problem, and, yes, this has a practical use. If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type. Starting from this kind of decomposition, then multiplying all the parts and convolving, is one way to do that. Yes, I could also use an arbitrary-precision floating-point library, but this approach is likely to be faster when only a few parts are involved, and it will definitely be lighter-weight.
If you want to ensure that floating point multiplications are exact, you need to make sure that any two numbers you multiply will never have more than half the digits that you have space for in your floating point type.
Exactly. This technique can be found in Veltkamp/Dekker multiplication. While accessing the bits of the representation as in other answers is a possibility, you can also do with only floating-point operations. There is one instance in this blog post. The part you are interested in is:
Input: f; coef is 1 + 2^N
p = f * coef;
q = f - p;
h = p + q; // h contains the 53-N highest bits of f
l = f - h; // l contains the N lowest bits of f
*, -, and + must be exactly the IEEE 754 operations at the precision of f for this to work. On Intel architectures, these operations are provided by the SSE2 instruction set. Visual C sets the precision of the historical FPU to 53 bits in the prelude of the C programs it compiles, which also helps.
The c way of decomposing numbers would be abs and frexp, which remove sign and exponent. The result necessarily lies in [ 0.5 , 1.0 ). Multiplying that by 1<<N means the integer part (obtained by modf) contains the top N bits.
You can use BitConverter.DoubleToInt64Bits and C#'s bitwise operators. You seem to be familiar with IEEE floating point formats so I'll not add more detail.
I just noticed tag C. In this case, you can use a union and do pretty much the same.
The real problems you have are:
Handling the implicit leading "1". In border cases, this would lead you to +0 / -0 situations. I can predict your code will be full of special cases because of this reason.
With very low exponents, your will get them out of range even before you consider the "leading 1" problem. Even if in-range you will need to resort to subnormals. Given the big gap between normal and subnormal numbers, I also dare to predict that there will be several ranges of valid floating point numbers that will have no possible representation in this scheme.
Except as noted above, handling of the exponent should be trivial: subtract 18 and 36 for the second and third 18-bit parts (and then find the leading 1, further decreasing it, of course).
Ugly solution? IEEE 754 is ugly by itself in the border cases. Big-endian/little-endian is the least of your problems.
Personally, I think this will get too complicated for your original objective. Just stick to a simple solution to your problem: find a function that counts trailing zeroes (does the standard itself defines one? I could be confusing with a libtrary) and ensure that the sum is > 52. Yes, your requirement of "half the digits(?)" (you meant 26 bits, right?) is stronger than necessary. And also wrong because it doesn't take into account the implicit 1. This is also why above I didn't say >= 52, but > 52.
Hope this helps.
Numerically, in general, you can shift left n digits, convert to integer and subtract.
a = (3.1415926535)*1000 = 3141.5926535
b = (int) a = 3141
c = a - (double) b = 0.5926535 << can convert this to 0.5926, etc.
d = (double) b / 1000 = 3.141 << except this MIGHT NOT be exact in base 2!!
But the principal is the same if you do all the mults/divides by powers of 2.
Related
Why does the following program print what it prints?
class Program
{
static void Main(string[] args)
{
float f1 = 0.09f*100f;
float f2 = 0.09f*99.999999f;
Console.WriteLine(f1 > f2);
}
}
Output is
false
Floating point only has so many digits of precision. If you're seeing f1 == f2, it is because any difference requires more precision than a 32-bit float can represent.
I recommend reading What Every Computer Scientist Should Read About Floating Point
The main thing is this isn't just .Net: it's a limitation of the underlying system most every language will use to represent a float in memory. The precision only goes so far.
You can also have some fun with relatively simple numbers, when you take into account that it's not even base ten. 0.1 (1/10th), for example, is a repeating decimal when represented in binary, just as 1/3rd is when represented in decimal.
In this particular case, it’s because .09 and .999999 cannot be represented with exact precision in binary (similarly, 1/3 cannot be represented with exact precision in decimal). For example, 0.111111111111111111101111 base 2 is 0.999998986721038818359375 base 10. Adding 1 to the previous binary value, 0.11111111111111111111 base 2 is 0.99999904632568359375 base 10. There isn’t a binary value for exactly 0.999999. Floating point precision is also limited by the space allocated for storing the exponent and the fractional part of the mantissa. Also, like integer types, floating point can overflow its range, although its range is larger than integer ranges.
Running this bit of C++ code in the Xcode debugger,
float myFloat = 0.1;
shows that myFloat gets the value 0.100000001. It is off by 0.000000001. Not a lot, but if the computation has several arithmetic operations, the imprecision can be compounded.
imho a very good explanation of floating point is in Chapter 14 of Introduction to Computer Organization with x86-64 Assembly Language & GNU/Linux by Bob Plantz of California State University at Sonoma (retired) http://bob.cs.sonoma.edu/getting_book.html. The following is based on that chapter.
Floating point is like scientific notation, where a value is stored as a mixed number greater than or equal to 1.0 and less than 2.0 (the mantissa), times another number to some power (the exponent). Floating point uses base 2 rather than base 10, but in the simple model Plantz gives, he uses base 10 for clarity’s sake. Imagine a system where two positions of storage are used for the mantissa, one position is used for the sign of the exponent* (0 representing + and 1 representing -), and one position is used for the exponent. Now add 0.93 and 0.91. The answer is 1.8, not 1.84.
9311 represents 0.93, or 9.3 times 10 to the -1.
9111 represents 0.91, or 9.1 times 10 to the -1.
The exact answer is 1.84, or 1.84 times 10 to the 0, which would be 18400 if we had 5 positions, but, having only four positions, the answer is 1800, or 1.8 times 10 to the zero, or 1.8. Of course, floating point data types can use more than four positions of storage, but the number of positions is still limited.
Not only is precision limited by space, but “an exact representation of fractional values in binary is limited to sums of inverse powers of two.” (Plantz, op. cit.).
0.11100110 (binary) = 0.89843750 (decimal)
0.11100111 (binary) = 0.90234375 (decimal)
There is no exact representation of 0.9 decimal in binary. Even carrying the fraction out more places doesn’t work, as you get into repeating 1100 forever on the right.
Beginning programmers often see floating point arithmetic as more
accurate than integer. It is true that even adding two very large
integers can cause overflow. Multiplication makes it even more likely
that the result will be very large and, thus, overflow. And when used
with two integers, the / operator in C/C++ causes the fractional part
to be lost. However, ... floating point representations have their own
set of inaccuracies. (Plantz, op. cit.)
*In floating point, both the sign of the number and the sign of the exponent are represented.
The Double data type cannot correctly represent some base 10 values. This is because of how floating point numbers represent real numbers. What this means is that when representing monetary values, one should use the decimal value type to prevent errors. (feel free to correct errors in this preamble)
What I want to know is what are the values which present such a problem under the Double data-type under a 64 bit architecture in the standard .Net framework (C# if that makes a difference) ?
I expect the answer the be a formula or rule to find such values but I would also like some example values.
Any number which cannot be written as the sum of positive and negative powers of 2 cannot be exactly represented as a binary floating-point number.
The common IEEE formats for 32- and 64-bit representations of floating-point numbers impose further constraints; they limit the number of binary digits in both the significand and the exponent. So there are maximum and minimum representable numbers (approximately +/- 10^308 (base-10) if memory serves) and limits to the precision of a number that can be represented. This limit on the precision means that, for 64-bit numbers, the difference between the exponent of the largest power of 2 and the smallest power in a number is limited to 52, so if your number includes a term in 2^52 it can't also include a term in 2^-1.
Simple examples of numbers which cannot be exactly represented in binary floating-point numbers include 1/3, 2/3, 1/5.
Since the set of floating-point numbers (in any representation) is finite, and the set of real numbers is infinite, one algorithm to find a real number which is not exactly representable as a floating-point number is to select a real number at random. The probability that the real number is exactly representable as a floating-point number is 0.
You generally need to be prepared for the possibility that any value you store in a double has some small amount of error. Unless you're storing a constant value, chances are it could be something with at least some error. If it's imperative that there never be any error, and the values aren't constant, you probably shouldn't be using a floating point type.
What you probably should be asking in many cases is, "How do I deal with the minor floating point errors?" You'll want to know what types of operations can result in a lot of error, and what types don't. You'll want to ensure that comparing two values for "equality" actually just ensures they are "close enough" rather than exactly equal, etc.
This question actually goes beyond any single programming language or platform. The inaccuracy is actually inherent in binary data.
Consider that with a double, each number N to the left (at 0-based index I) of the decimal point represents the value N * 2^I and every digit to the right of the decimal point represents the value N * 2^(-I).
As an example, 5.625 (base 10) would be 101.101 (base 2).
Given this calculation, and decimal value that can't be calculated as a sum of 2^(-I) for different values of I would have an incorrect value as a double.
A float is represented as s, e and m in the following formula
s * m * 2^e
This means that any number that cannot be represented using the given expression (and in the respective domains of s, e and m) cannot be represented exactly.
Basically, you can represent all numbers between 0 and 2^53 - 1 multiplied by a certain power of two (possibly a negative power).
As an example, all numbers between 0 and 2^53 - 1 can be represented multiplied with 2^0 = 1. And you can also represent all those numbers by dividing them by 2 (with a .5 fraction). And so on.
This answer does not fully cover the topic, but I hope it helps.
I'm messing around with Fourier transformations. Now I've created a class that does an implementation of the DFT (not doing anything like FFT atm). This is the implementation I've used:
public static Complex[] Dft(double[] data)
{
int length = data.Length;
Complex[] result = new Complex[length];
for (int k = 1; k <= length; k++)
{
Complex c = Complex.Zero;
for (int n = 1; n <= length; n++)
{
c += Complex.FromPolarCoordinates(data[n-1], (-2 * Math.PI * n * k) / length);
}
result[k-1] = 1 / Math.Sqrt(length) * c;
}
return result;
}
And these are the results I get from Dft({2,3,4})
Well it seems pretty okay, since those are the values I expect. There is only one thing I find confusing. And it all has to do with the rounding of doubles.
First of all, why are the first two numbers not exactly the same (0,8660..443 8 ) vs (0,8660..443). And why can't it calculate a zero, where you'd expect it. I know 2.8E-15 is pretty close to zero, but well it's not.
Anyone know how these, marginal, errors occur and if I can and want to do something about it.
It might seem that there's not a real problem, because it's just small errors. However, how do you deal with these rounding errors if you're for example comparing 2 values.
5,2 + 0i != 5,1961524 + i2.828107*10^-15
Cheers
I think you've already explained it to yourself - limited precision means limited precision. End of story.
If you want to clean up the results, you can do some rounding of your own to a more reasonable number of siginificant digits - then your zeros will show up where you want them.
To answer the question raised by your comment, don't try to compare floating point numbers directly - use a range:
if (Math.Abs(float1 - float2) < 0.001) {
// they're the same!
}
The comp.lang.c FAQ has a lot of questions & answers about floating point, which you might be interested in reading.
From http://support.microsoft.com/kb/125056
Emphasis mine.
There are many situations in which precision, rounding, and accuracy in floating-point calculations can work to generate results that are surprising to the programmer. There are four general rules that should be followed:
In a calculation involving both single and double precision, the result will not usually be any more accurate than single precision. If double precision is required, be certain all terms in the calculation, including constants, are specified in double precision.
Never assume that a simple numeric value is accurately represented in the computer. Most floating-point values can't be precisely represented as a finite binary value. For example .1 is .0001100110011... in binary (it repeats forever), so it can't be represented with complete accuracy on a computer using binary arithmetic, which includes all PCs.
Never assume that the result is accurate to the last decimal place. There are always small differences between the "true" answer and what can be calculated with the finite precision of any floating point processing unit.
Never compare two floating-point values to see if they are equal or not- equal. This is a corollary to rule 3. There are almost always going to be small differences between numbers that "should" be equal. Instead, always check to see if the numbers are nearly equal. In other words, check to see if the difference between them is very small or insignificant.
Note that although I referenced a microsoft document, this is not a windows problem. It's a problem with using binary and is in the CPU itself.
And, as a second side note, I tend to use the Decimal datatype instead of double: See this related SO question: decimal vs double! - Which one should I use and when?
In C# you'll want to use the 'decimal' type, not double for accuracy with decimal points.
As to the 'why'... repsensenting fractions in different base systems gives different answers. For example 1/3 in a base 10 system is 0.33333 recurring, but in a base 3 system is 0.1.
The double is a binary value, at base 2. When converting to base 10 decimal you can expect to have these rounding errors.
All the methods in System.Math takes double as parameters and returns parameters. The constants are also of type double. I checked out MathNet.Numerics, and the same seems to be the case there.
Why is this? Especially for constants. Isn't decimal supposed to be more exact? Wouldn't that often be kind of useful when doing calculations?
This is a classic speed-versus-accuracy trade off.
However, keep in mind that for PI, for example, the most digits you will ever need is 41.
The largest number of digits of pi
that you will ever need is 41. To
compute the circumference of the
universe with an error less than the
diameter of a proton, you need 41
digits of pi †. It seems safe to
conclude that 41 digits is sufficient
accuracy in pi for any circle
measurement problem you're likely to
encounter. Thus, in the over one
trillion digits of pi computed in
2002, all digits beyond the 41st have
no practical value.
In addition, decimal and double have a slightly different internal storage structure. Decimals are designed to store base 10 data, where as doubles (and floats), are made to hold binary data. On a binary machine (like every computer in existence) a double will have fewer wasted bits when storing any number within its range.
Also consider:
System.Double 8 bytes Approximately ±5.0e-324 to ±1.7e308 with 15 or 16 significant figures
System.Decimal 12 bytes Approximately ±1.0e-28 to ±7.9e28 with 28 or 29 significant figures
As you can see, decimal has a smaller range, but a higher precision.
No, - decimals are no more "exact" than doubles, or for that matter, any type. The concept of "exactness", (when speaking about numerical representations in a compuiter), is what is wrong. Any type is absolutely 100% exact at representing some numbers. unsigned bytes are 100% exact at representing the whole numbers from 0 to 255. but they're no good for fractions or for negatives or integers outside the range.
Decimals are 100% exact at representing a certain set of base 10 values. doubles (since they store their value using binary IEEE exponential representation) are exact at representing a set of binary numbers.
Neither is any more exact than than the other in general, they are simply for different purposes.
To elaborate a bit furthur, since I seem to not be clear enough for some readers...
If you take every number which is representable as a decimal, and mark every one of them on a number line, between every adjacent pair of them there is an additional infinity of real numbers which are not representable as a decimal. The exact same statement can be made about the numbers which can be represented as a double. If you marked every decimal on the number line in blue, and every double in red, except for the integers, there would be very few places where the same value was marked in both colors.
In general, for 99.99999 % of the marks, (please don't nitpick my percentage) the blue set (decimals) is a completely different set of numbers from the red set (the doubles).
This is because by our very definition for the blue set is that it is a base 10 mantissa/exponent representation, and a double is a base 2 mantissa/exponent representation. Any value represented as base 2 mantissa and exponent, (1.00110101001 x 2 ^ (-11101001101001) means take the mantissa value (1.00110101001) and multiply it by 2 raised to the power of the exponent (when exponent is negative this is equivilent to dividing by 2 to the power of the absolute value of the exponent). This means that where the exponent is negative, (or where any portion of the mantissa is a fractional binary) the number cannot be represented as a decimal mantissa and exponent, and vice versa.
For any arbitrary real number, that falls randomly on the real number line, it will either be closer to one of the blue decimals, or to one of the red doubles.
Decimal is more precise but has less of a range. You would generally use Double for physics and mathematical calculations but you would use Decimal for financial and monetary calculations.
See the following articles on msdn for details.
Double
http://msdn.microsoft.com/en-us/library/678hzkk9.aspx
Decimal
http://msdn.microsoft.com/en-us/library/364x0z75.aspx
Seems like most of the arguments here to "It does not do what I want" are "but it's faster", well so is ANSI C+Gmp library, but nobody is advocating that right?
If you particularly want to control accuracy, then there are other languages which have taken the time to implement exact precision, in a user controllable way:
http://www.doughellmann.com/PyMOTW/decimal/
If precision is really important to you, then you are probably better off using languages that mathematicians would use. If you do not like Fortran then Python is a modern alternative.
Whatever language you are working in, remember the golden rule:
Avoid mixing types...
So do convert a and b to be the same before you attempt a operator b
If I were to hazard a guess, I'd say those functions leverage low-level math functionality (perhaps in C) that does not use decimals internally, and so returning a decimal would require a cast from double to decimal anyway. Besides, the purpose of the decimal value type is to ensure accuracy; these functions do not and cannot return 100% accurate results without infinite precision (e.g., irrational numbers).
Neither Decimal nor float or double are good enough if you require something to be precise. Furthermore, Decimal is so expensive and overused out there it is becoming a regular joke.
If you work in fractions and require ultimate precision, use fractions. It's same old rule, convert once and only when necessary. Your rounding rules too will vary per app, domain and so on, but sure you can find an odd example or two where it is suitable. But again, if you want fractions and ultimate precision, the answer is not to use anything but fractions. Consider you might want a feature of arbitrary precision as well.
The actual problem with CLR in general is that it is so odd and plain broken to implement a library that deals with numerics in generic fashion largely due to bad primitive design and shortcoming of the most popular compiler for the platform. It's almost the same as with Java fiasco.
double just turns out to be the best compromise covering most domains, and it works well, despite the fact MS JIT is still incapable of utilising a CPU tech that is about 15 years old now.
[piece to users of MSDN slowdown compilers]
Double is a built-in type. Is is supported by FPU/SSE core (formerly known as "Math coprocessor"), that's why it is blazingly fast. Especially at multiplication and scientific functions.
Decimal is actually a complex structure, consisting of several integers.
Why does the following program print what it prints?
class Program
{
static void Main(string[] args)
{
float f1 = 0.09f*100f;
float f2 = 0.09f*99.999999f;
Console.WriteLine(f1 > f2);
}
}
Output is
false
Floating point only has so many digits of precision. If you're seeing f1 == f2, it is because any difference requires more precision than a 32-bit float can represent.
I recommend reading What Every Computer Scientist Should Read About Floating Point
The main thing is this isn't just .Net: it's a limitation of the underlying system most every language will use to represent a float in memory. The precision only goes so far.
You can also have some fun with relatively simple numbers, when you take into account that it's not even base ten. 0.1 (1/10th), for example, is a repeating decimal when represented in binary, just as 1/3rd is when represented in decimal.
In this particular case, it’s because .09 and .999999 cannot be represented with exact precision in binary (similarly, 1/3 cannot be represented with exact precision in decimal). For example, 0.111111111111111111101111 base 2 is 0.999998986721038818359375 base 10. Adding 1 to the previous binary value, 0.11111111111111111111 base 2 is 0.99999904632568359375 base 10. There isn’t a binary value for exactly 0.999999. Floating point precision is also limited by the space allocated for storing the exponent and the fractional part of the mantissa. Also, like integer types, floating point can overflow its range, although its range is larger than integer ranges.
Running this bit of C++ code in the Xcode debugger,
float myFloat = 0.1;
shows that myFloat gets the value 0.100000001. It is off by 0.000000001. Not a lot, but if the computation has several arithmetic operations, the imprecision can be compounded.
imho a very good explanation of floating point is in Chapter 14 of Introduction to Computer Organization with x86-64 Assembly Language & GNU/Linux by Bob Plantz of California State University at Sonoma (retired) http://bob.cs.sonoma.edu/getting_book.html. The following is based on that chapter.
Floating point is like scientific notation, where a value is stored as a mixed number greater than or equal to 1.0 and less than 2.0 (the mantissa), times another number to some power (the exponent). Floating point uses base 2 rather than base 10, but in the simple model Plantz gives, he uses base 10 for clarity’s sake. Imagine a system where two positions of storage are used for the mantissa, one position is used for the sign of the exponent* (0 representing + and 1 representing -), and one position is used for the exponent. Now add 0.93 and 0.91. The answer is 1.8, not 1.84.
9311 represents 0.93, or 9.3 times 10 to the -1.
9111 represents 0.91, or 9.1 times 10 to the -1.
The exact answer is 1.84, or 1.84 times 10 to the 0, which would be 18400 if we had 5 positions, but, having only four positions, the answer is 1800, or 1.8 times 10 to the zero, or 1.8. Of course, floating point data types can use more than four positions of storage, but the number of positions is still limited.
Not only is precision limited by space, but “an exact representation of fractional values in binary is limited to sums of inverse powers of two.” (Plantz, op. cit.).
0.11100110 (binary) = 0.89843750 (decimal)
0.11100111 (binary) = 0.90234375 (decimal)
There is no exact representation of 0.9 decimal in binary. Even carrying the fraction out more places doesn’t work, as you get into repeating 1100 forever on the right.
Beginning programmers often see floating point arithmetic as more
accurate than integer. It is true that even adding two very large
integers can cause overflow. Multiplication makes it even more likely
that the result will be very large and, thus, overflow. And when used
with two integers, the / operator in C/C++ causes the fractional part
to be lost. However, ... floating point representations have their own
set of inaccuracies. (Plantz, op. cit.)
*In floating point, both the sign of the number and the sign of the exponent are represented.