Wasted exponent bit in C# double representation - c#

I've been doing research on floating-point doubles in .NET lately. While reading Jon Skeet's article Binary floating points and .NET, I had a question.
Let's start with the example of 46.428292315077 in the article.
Represented as a 64 bit double, this equates to the following bits:
Sign Exponent Mantissa
0 10000000100 0111001101101101001001001000010101110011000100100011
One bit is used to represent the sign, 11 bits are used to represent the exponent, and 52 bits are used to represent the mantissa. Note the bias of 1023 for doubles (which I assume is to allow for negative exponents - more on this later).
My confusion is with the 11 bits which represent the exponent, and their use (or lack thereof) for large numbers, specifically double.MaxValue (1.7976931348623157E+308).
For the exponent, there are a few special values as cited in the article which help determine a number's value. All zeroes represent 0; all ones represent NaN and positive/negative infinity. There are 11 bits to work with: the first bit of the exponent is bias, so we can disregard that. This gives us 10 bits which control the actual size of the exponent.
The exponent on double.MaxValue is 308, which can be represented with 9 bits (100110100, or with bias: 10100110100). The smallest fractional value is double.Epsilon (4.94065645841247E-324), and its exponent can still be represented in 9 bits (101000100, or with bias: 00101000100).
You might notice that the first bit after the bias always seems to be wasted. Are my assumptions about negative exponents correct? If so, why is the second bit after the bias wasted? Regardless, it seems like the actual largest number we could represent (while respecting the special values and a possible sign bit after the bias) is 111111111 (or 511 in base 10).
If the bit after the bias is actually wasted, why can't we represent numbers with exponents larger than 324? What am I misunderstanding about this?

There are no wasted bits in a double.
Let's sort out your confusion. How do we turn a double from bits into a mathematical value? Let's assume the double is not zero, infinity, negative infinity, NaN or a denormal, because those all have special rules.
The crux of your confusion is mixing up decimal quantities with binary quantities. For this answer I'll put all binary quantities in this formatting and decimal quantities in regular formatting.
We take the 52 bits of the mantissa and we put them after 1. So in your example, that would be
1.0111001101101101001001001000010101110011000100100011
That's a binary number. So 1 + 0/2 + 1/4 + 1/8 + 1/16 + 0/32 ...
Then we take the 11 bits of the exponent, treat that as an 11 bit unsigned integer, and subtract 1023 from that value. So in your example we have 10000000100 which is the unsigned integer 1028. Subtract 1023, and we get 5.
Now we shift the "decimal place" (ha ha) by 5 places:
101110.01101101101001001001000010101110011000100100011
Note that this is equivalent to multiplying by 25. It is not multiplying by 105!
And now we multiply the whole thing by 1 if the sign bit is 0, and -1 if the sign bit is 1. So the final answer is
101110.01101101101001001001000010101110011000100100011
Let's see an example with a negative exponent.
Suppose the exponent had been 01111111100. That's 1020 as an unsigned integer. Subtract 1023. We get -3, so we would shift three places to the left, and get:
0.0010111001101101101001001001000010101110011000100100011
Let's see an example with a large exponent. What if the exponent had been 11111111100 ?
Work it out. That's 2044 in decimal. Subtract 1023. That's 1021. So this number would be the extremely large number that you get when multiplying 1.0111001101101101001001001000010101110011000100100011 by 21021.
So the value of that double is exactly equal to
32603055608669827528875188998863283395233949199438288081243712122350844851941321466156747022359800582932574058697506453751658312301708309704448596122037141141297743099124156580613023692715652869864010740666615694378079258090383719888417882332809291228958035810952632190230935024250237637887765563383983636480
Which is approximately 3.26030556 x 10307.
Is that now clear?
If this subject interests you, here's some further reading:
Code to decode a double into its parts:
https://ericlippert.com/2015/11/30/the-dedoublifier-part-one/
A simple arbitrary-precision rational:
https://ericlippert.com/2015/12/03/the-dedoublifier-part-two/
Code to turn a double into its exact rational:
https://ericlippert.com/2015/12/07/the-dedoublifier-part-three/
Representation of floats:
https://blogs.msdn.microsoft.com/ericlippert/2005/01/10/floating-point-arithmetic-part-one/
How Benford's Law is used to minimize representation errors:
https://blogs.msdn.microsoft.com/ericlippert/2005/01/13/floating-point-and-benfords-law-part-two/
What algorithm do we use to display floats as decimal quantities?
https://blogs.msdn.microsoft.com/ericlippert/2005/01/17/fun-with-floating-point-arithmetic-part-three/
What happens when you try to compare for equality floats of different precision levels?
https://blogs.msdn.microsoft.com/ericlippert/2005/01/18/fun-with-floating-point-arithmetic-part-four/
What properties of standard arithmetic fail to hold in floating point?
https://blogs.msdn.microsoft.com/ericlippert/2005/01/20/fun-with-floating-point-arithmetic-part-five/
How are infinities and divisions by zero represented?
https://blogs.msdn.microsoft.com/ericlippert/2009/10/15/as-timeless-as-infinity/

Related

Float and double - Significand numbers- Mantissa POV?

With Single precision (32 bits): the bits division goes like this :
So we have 23 bits of mantissa/Significand .
So we can represent 2^23 numbers (via 23 bits ) : which is 8388608 --> which is 7 digit long.
BUT
I was reading that the mantissa is normalized (the leading digit in the mantissa will always be a 1) - so the pattern is actually 1.mmm and only the mmm is represented in the mantissa.
for example : look here :
0.75 is represented but it's actually 1.75
Question #1
So basically it adds 1 more precision digit....no ?
If so then we have 8 Significand !
So why does msdn says : 7 ?
Question #2
In double there are 52 bits for mantissa. (0..51)
If I add 1 for the normalized mantissa so its 2^53 possibilites which is : 9007199254740992 ( 16 digits)
and MS does say : 15-16 :
Why is this inconsistency ? am I missing something ?
It doesn't add one more decimal digit - just a single binary digit. So instead of 23 bits, you have 24 bits. This is handy, because the only number you can't represent as starting with a one is zero, and that's a special value.
In short, you're not looking at 2 ^ 24 (that would be a decimal number, base-10) - you're looking at 2 ^ (-24). That's the most important difference between float-double and decimal. decimal is what you imagine floats to be, ie. a simple exponent-shifted, base-10 number. float and double aren't that.
Now, decimal digits versus binary digits is a tricky matter. You're mistaken in your understanding that the precision has anything to do with the 2 ^ 24 figure - that would only be true if you were talking about e.g. the decimal type, which actually stores decimal values as decimal point offsets of a normal (huge-ass) integer.
Just like 1 / 3 cannot be written in decimal (0.333333...), many simple decimal numbers can't be represented in a float precisely (0.2 is the typical example). decimal doesn't have a problem with that - it's just 2 shifted one digit to the right, easy peasy. For floats, however, you have to represent this value as a sum of negative powers of two - 0.5, 0.25, 0.125 ... The same would apply in the opposite direction if 2 wasn't a factor of 10 - every finite binary "decimal" can be represented with finite precision in decimal.
Now, in fact, float can easily represent a number with 24 decimal digits - it just has to be 2 ^ (-24) - a number you're not going to encounter in your usual day job, and a weird number in decimal. So where does the 7 (actually more like 7.22...) come from? Simple, just do a decimal logarithm of 2 ^ (-24).
The fact that it seems that 0.2 can be represented "exactly" in a float is simply because everytime you e.g. convert it to a string, you're rounding. So, even though the number isn't 0.2 exactly, it ends up that way when you convert it to a decimal number.
All this means that when you need decimal precision, you want to use decimal, as simple as that. This is not because it's a better base for calculations, it's simply because humans use it, and they will not be happy if your application gives different results from what they calculate on a piece of paper - especially when dealing with money. Accountants are very focused on having everything correct to the least significant digit.
Floats are used where it's not about decimal precision, but rather about generally having some sort of precision - this makes them well suited for physics calculations and similar, because you don't actually care about having the number come up the same in decimal - you're working with a given precision, and you're going to get that - 24 significant binary "decimals".
The implied leading 1 adds one more binary digit of precision, not decimal.

Double vs Decimal Rounding in C#

Why does:
double dividend = 1.0;
double divisor = 3.0;
Console.WriteLine(dividend / divisor * divisor);
output 1.0,
but:
decimal dividend = 1;
decimal divisor = 3;
Console.WriteLine(dividend / divisor * divisor);
outputs 0.9999999999999999999999999999
?
I understand that 1/3 can't be computed exactly, so there must be some rounding.
But why does Double round the answer to 1.0, but Decimal does not?
Also, why does double compute 1.0/3.0 to be 0.33333333333333331?
If rounding is used, then wouldn't the last 3 get rounded to 0, why 1?
Why 1/3 as a double is 0.33333333333333331
The closest way to represent 1/3 in binary is like this:
0.0101010101...
That's the same as the series 1/4 + (1/4)^2 + (1/4)^3 + (1/4)^4...
Of course, this is limited by the number of bits you can store in a double. A double is 64 bits, but one of those is the sign bit and another 11 represent the exponent (think of it like scientific notation, but in binary). So the rest, which is called the mantissa or significand is 52 bits. Assume a 1 to start and then use two bits for each subsequent power of 1/4. That means you can store:
1/4 + 1/4^2 + ... + 1/4 ^ 27
which is 0.33333333333333331
Why multiplying by 3 rounds this to 1
So 1/3 represented in binary and limited by the size of a double is:
0.010101010101010101010101010101010101010101010101010101
I'm not saying that's how it's stored. Like I said, you store the bits starting after the 1, and you use separate bits for the exponent and the sign. But I think it's useful to consider how you'd actually write it in base 2.
Let's stick with this "mathematician's binary" representation and ignore the size limits of a double. You don't have to do it this way, but I find it convenient. If we want to take this approximation for 1/3 and multiply by 3, that's the same as bit shifting to multiply by 2 and then adding what you started with. This gives us 1/3 * 3 = 0.111111111111111111111111111111111111111111111111111111
But can a double store that? No, remember, you can only have 52 bits of mantissa after the first 1, and that number has 54 ones. So we know that it'll be rounded, in this case rounded up to exactly 1.
Why for decimal you get 0.9999999999999999999999999999
With decimal, you get 96 bits to represent an integer, with additional bits representing the exponent up to 28 powers of 10. So even though ultimately it's all stored as binary, here we're working with powers of 10 so it makes sense to think of the number in base 10. 96 bits lets us express up to 79,228,162,514,264,337,593,543,950,335, but to represent 1/3 we're going to go with all 3's, up to the 28 of them that we can shift to the right of the decimal point: 0.3333333333333333333333333333.
Multiplying this approximation for 1/3 by 3 gives us a number we can represent exactly. It's just 28 9's, all shifted to the right of the decimal point: 0.9999999999999999999999999999. So unlike with double's there's not a second round of rounding at this point.
This is by design of the decimal type which is optimized for accuracy unlike the double type which is optimized for low accuracy but higher performance.
The Decimal value type represents decimal numbers ranging from positive 79,228,162,514,264,337,593,543,950,335 to negative 79,228,162,514,264,337,593,543,950,335.
The Decimal value type is appropriate for financial calculations requiring large numbers of significant integral and fractional digits and no round-off errors. The Decimal type does not eliminate the need for rounding. Rather, it minimizes errors due to rounding. Thus your code produces a result of 0.9999999999999999999999999999 rather than 1.
One reason that infinite decimals are a necessary extension of finite decimals is to represent fractions. Using long division, a simple division of integers like 1⁄9 becomes a recurring decimal, 0.111…, in which the digits repeat without end. This decimal yields a quick proof for 0.999… = 1. Multiplication of 9 times 1 produces 9 in each digit, so 9 × 0.111… equals 0.999… and 9 × 1⁄9 equals 1, so 0.999… = 1:

Floating point operations ambiguity [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why is floating point arithmetic in C# imprecise?
Why is there a bias in floating point ops? Any specific reason?
Output:
160
139
static void Main()
{
float x = (float) 1.6;
int y = (int)(x * 100);
float a = (float) 1.4;
int b = (int)(a * 100);
Console.WriteLine(y);
Console.WriteLine(b);
Console.ReadKey();
}
Any rational number that has a denominator that is not a power of 2 will lead to an infinite number of digits when represented as a binary. Here you have 8/5 and 7/5. Therefore there is no exact binary representation as a floating-point number (unless you have infinite memory).
The exact binary representation of 1.6 is 110011001100110011001100110011001100...
The exact binary representation of 1.4 is 101100110011001100110011001100110011...
Both values have an infinite number of digits (1100 is repeated endlessly).
float values have a precision of 24 bits. So the binary representation of any value will be rounded to 24 bits. If you round the given values to 24 bits you get:
1.6: 110011001100110011001101 (decimal 13421773) - rounded up
1.4: 101100110011001100110011 (decimal 11744051) - rounded down
Both values have an exponent of 0 (the first bit is 2^0 = 1, the second is 2^-1 = 0.5 etc.).
Since the first bit in a 24 bit value is 2^23 you can calculate the exact decimal values by dividing the 24 bit values (13421773 and 11744051) by two 23 times.
The values are: 1.60000002384185791015625 and 1.39999997615814208984375.
When using floating-point types you always have to consider that their precision is finite. Values that can be written exact as decimal values might be rounded up or down when represented as binaries. Casting to int does not respect that because it truncates the given values. You should always use something like Math.Round.
If you really need an exact representation of rational numbers you need a completely different approach. Since rational numbers are fractions you can use integers to represent them. Here is an example of how you can achieve that.
However, you can not write Rational x = (Rational)1.6 then. You have to write something like Rational x = new Rational(8, 5) (or new Rational(16, 10) etc.).
This is due to the fact that floating point arithmetic is not precise. When you set a to 1.4, internally it may not be exactly 1.4, just as close as can be made with machine precision. If it is fractionally less than 1.4, then multiplying by 100 and casting to integer will take only the integer portion which in this case would be 139. You will get far more technically precise answers but essentially this is what is happening.
In the case of your output for the 1.6 case, the floating point representation may actually be minutely larger than 1.6 and so when you multiply by 100, the total is slightly larger than 160 and so the integer cast gives you what you expect. The fact is that there is simply not enough precision available in a computer to store every real number exactly.
See this link for details of the conversion from floating point to integer types http://msdn.microsoft.com/en-us/library/aa691289%28v=vs.71%29.aspx - it has its own section.
The floating point types float (32 bit) and double (64 bit) have a limited precision and more over the value is represented as a binary value internally. Just as you cannot represent 1/7 precisely in a decimal system (~ 0.1428571428571428...), 1/10 cannot be represented precisely in a binary system.
You can however use the decimal type. It still has a limited (however high) precision, but the numbers a represented in a decimal way internally. Therefore a value like 1/10 is represented exactly like 0.1000000000000000000000000000 internally. 1/7 is still a problem for decimal. But at least you don't get a loss of precision by converting to binary and then back to decimal.
Consider using decimal.

Precision of double after decimal point

In the lunch break we started debating about the precision of the double value type.
My colleague thinks, it always has 15 places after the decimal point.
In my opinion one can't tell, because IEEE 754 does not make assumptions
about this and it depends on where the first 1 is in the binary
representation. (i.e. the size of the number before the decimal point counts, too)
How can one make a more qualified statement?
As stated by the C# reference, the precision is from 15 to 16 digits (depending on the decimal values represented) before or after the decimal point.
In short, you are right, it depends on the values before and after the decimal point.
For example:
12345678.1234567D //Next digit to the right will get rounded up
1234567.12345678D //Next digit to the right will get rounded up
Full sample at: http://ideone.com/eXvz3
Also, trying to think about double value as fixed decimal values is not a good idea.
You're both wrong. A normal double has 53 bits of precision. That's roughly equivalent to 16 decimal digits, but thinking of double values as though they were decimals leads to no end of confusion, and is best avoided.
That said, you are much closer to correct than your colleague--the precision is relative to the value being represented; sufficiently large doubles have no fractional digits of precision.
For example, the next double larger than 4503599627370496.0 is 4503599627370497.0.
C# doubles are represented according to IEEE 754 with a 53 bit significand p (or mantissa) and a 11 bit exponent e, which has a range between -1022 and 1023. Their value is therefore
p * 2^e
The significand always has one digit before the decimal point, so the precision of its fractional part is fixed. On the other hand the number of digits after the decimal point in a double depends also on its exponent; numbers whose exponent exceeds the number of digits in the fractional part of the significand do not have a fractional part themselves.
What Every Computer Scientist Should Know About Floating-Point Arithmetic is probably the most widely recognized publication on this subject.
Since this is the only question on SO that I could find on this topic, I would like to make an addition to jorgebg's answer.
According to this, precision is actually 15-17 digits. An example of a double with 17 digits of precision would be 0.92107099070578813 (don't ask me how I got that number :P)

Check if decimal contains decimal places by looking at the bytes

There is a similar question in here. Sometimes that solution gives exceptions because the numbers might be to large.
I think that if there is a way of looking at the bytes of a decimal number it will be more efficient. For example a decimal number has to be represented by some n number of bytes. For example an Int32 is represented by 32 bits and all the numbers that start with the bit of 1 are negative. Maybe there is some kind of similar relationship with decimal numbers. How could you look at the bytes of a decimal number? or the bytes of an integer number?
If you are really talking about decimal numbers (as opposed to floating-point numbers), then Decimal.GetBits will let you look at the individual bits of a decimal. The MSDN page also contains a description of the meaning of the bits.
On the other hand, if you just want to check whether a number has a fractional part or not, doing a simple
var hasFractionalPart = (myValue - Math.Round(myValue) != 0)
is much easier than decoding the binary structure. This should work for decimals as well as classic floating-point data types such as float or double. In the latter case, due to floating-point rounding error, it might make sense to check for Math.Abs(myValue - Math.Round(myValue)) < someThreshold instead of comparing to 0.
If you want a reasonably efficient way of getting the 'decimal' value of a decimal type you can just mod it by one.
decimal number = 4.75M;
decimal fractionalPart = number % 1;
Console.WriteLine(fractionalPart); //will print 0.75
While it may not be the theoretically optimal solution, it'll be quite fast, and almost certainly fast enough for your purposes (far better than string manipulation and parsing, which is a common naive approach).
You can use Decimal.GetBits in order to retrieve the bits from a decimal structure.
The MSDN page linked above details how they are laid out in memory:
The binary representation of a Decimal number consists of a 1-bit sign, a 96-bit integer number, and a scaling factor used to divide the integer number and specify what portion of it is a decimal fraction. The scaling factor is implicitly the number 10, raised to an exponent ranging from 0 to 28.
The return value is a four-element array of 32-bit signed integers.
The first, second, and third elements of the returned array contain the low, middle, and high 32 bits of the 96-bit integer number.
The fourth element of the returned array contains the scale factor and sign. It consists of the following parts:
Bits 0 to 15, the lower word, are unused and must be zero.
Bits 16 to 23 must contain an exponent between 0 and 28, which indicates the power of 10 to divide the integer number.
Bits 24 to 30 are unused and must be zero.
Bit 31 contains the sign; 0 meaning positive, and 1 meaning negative.
Going with Oded's detailed info to use GetBits, I came up with this
const int EXP_MASK = 0x00FF0000;
bool hasDecimal = (Decimal.GetBits(value)[3] & EXP_MASK) != 0x0;

Categories

Resources