Dividing an int by zero, will throw an exception, but a float won't - at least in Java. Why does a float have additional NaN info, while an int type doesn't?
The representation of a float has been designed such that there are some special combination of bits reserved to store special values such as NaN, infinity, etc.
There are no unused representations for an int type - every bit pattern corresponds to an integer. This has many advantages:
The range of an integer type is as large as possible - no bit patterns are wasted.
The representation of an integer is easy to understand because there are no special cases.
Integer arithmetic can be done at extremely high speed even on very simple processors.
A clear Explanation about float arithmetic is given here
http://www.artima.com/underthehood/floatingP.html
I think the real reason, the root of this, is the well known fact: computers store everything in zeroes and ones.
What does it have to do with integers, floats and zero division? It's pretty simple. If you have only zeroes and ones, it is pretty easy to combine them into integer numbers, like you do with decimal digits. So "10" becomes two, "11" becomes three and so on. This kind of integer representation is so natural that no one would think of inventing anything else for integers, it would just make CPUs more complicated and things more confusing. The only "invention" that was required is to figure out how to store negative numbers, but that's also very natural and simple if you start from the point that x+(-x) should always be equal to zero, without using any special kind of addition here. That's why 11111111 is -1 for 8-bit integers, because if you add 1 to it, it becomes 100000000, then 8th bit is truncated due to overflow and you get your zero. But this natural format has no place for infinities and NaNs, and nobody wanted to invent a non-natural representation just for that. Well, I won't be surprised if someone actually did that, but there is no way such format would become well-known and widely used.
Now, for floating-point numbers, there is no natural representation. Even if we translate 0.5 to binary, it would still be something like 0.1 only now we have "binary point" instead of decimal point. But CPUs can't naturally represent a "point", only 1 and 0. So some kind of special format was needed. There was simply no other way to go. And then someone probably suggested, "Hey guys, while we are at it, why not to include special representation for infinity and other numeric nonsense?" and so it was done.
This is the reason why these formats are so different. How to handle divisions by zero, it's up to language designers, but for floating-points they have the choice between inf/NaN and exceptions, while for integers they don't naturally have such kind of thing.
Basically, it's a purely arbitrary decision.
The traditional int tries to use all the bits for representing possible numbers, whereas IEEE 754 standard reserves a special value for NaN.
The standard could be changed for ints to include special values, at a cost of less efficient operations. The developers usually expect int operations to be very efficient, whereas the operations with floating point numbers are (purely psychologically) more allowed to be slower.
Ints and floats are represented differently inside the machine. Integers usually use a signed, two's complement representation that is (essentially) the number written out in base two. Floats, on the other hand, use a more complex representation that can hold much larger and much smaller values. However, the machine reserves several special bit patterns for floats to mean things other than numbers. There's values for NaN, and for positive or negative infinity, for example. This means that if you divide a float by zero, there is a series of bits that the computer can use to encode that you divided by zero. For ints, all bit patterns are used to encode numbers, so there's no meaningful series of bits the computer could use to represent the error.
This isn't an essential property of ints, though. One could, in theory, make an integer representation that handles division by zero by returning some NaN variant. It's just not what's done in practice.
Java reflects the way most CPUs are implemented. Integer divide by zero causes an interrupt on x86/x64 and Floating point divide by zero results in Infinity, Negative infinity or NaN. Note: with floating point you can also divide by negative zero. :P
Related
My team is working with financial software that exposes monetary values as C# floating point doubles. Occasionally, we need to compare these values to see if they equal zero, or fall under a particular limit. When I noticed unexpected behavior in this logic, I quickly learned about the rounding errors inherent in floating point doubles (e.g. 1.1 + 2.2 = 3.3000000000000003). Up until this point, I have primarily used C# decimals to represent monetary values.
My team decided to resolve this issue by using the epsilon value approach. Essentially, when you are compare two numbers, if the difference between those two numbers is less than epsilon, they are considered equal. We implemented this approach in a similar way as described in the article below:
https://www.codeproject.com/Articles/383871/Demystify-Csharp-floating-point-equality-and-relat
Our challenge has been determining an appropriate value for epsilon. Our monetary values can have up to 3 digits to the right of the decimal point (scale = 3). This means that the largest epsilon we could use is .0001 (anything larger and the 3rd digit gets ignored). Since epsilon values are supposed to be small, we decided to move it out one more decimal point to .00001 (just to be safe, you could say). C# doubles have a precision of at least 15 digits, so I believe this value of epsilon should work if the number to the left of the decimal point is less or equal to 10 digits (15 - 5 = 10, where 5 is the number of digits epsilon is to the right of the decimal point). With 10 digits, we can represent values into the billions, up to 9,999,999,999.999. It's possible that we may have numbers in the hundreds of millions, but we don't expect to go into the billions, so this limit should suffice.
Is my rationale for choosing this value of epsilon correct? I found a lot of resources that discuss this approach, but I couldn’t find many resources that provide guidance on choosing epsilon.
Your reasoning seems sound, but as you have already discovered it is a complicated issue. You might want to read What Every Computer Scientist Should Know About Floating-Point Arithmetic. You do have a minimum of 15 digits of precision using 64 bit doubles. However, you will also want to validate your inputs as floats can contain Nan, +/- Infinity, negative zero and a considerably larger "range" than 15 decimal digits. If someone hands your library a value like 1.2E102, should you process it or consider it out of range? Ditto with very small values. Garbage In, Garbage out, but it might be nice if you code detected the "smell" of garbage and at very least logged it.
You might also want to consider providing a property for setting precision as well as different forms of rounding. That depends largely on the specifications you are working with. You might also want to determine if these values can represent currencies other than dollars (1 dollar is currently >112 yen).
Long and the short of it choosing your epsilon a digit below your needs (so four digits to the right of the decimal) is sound and gives you a digit to use for consistent rounding. Otherwise $10.0129 and $10.0121 would be equal but their sum would be $20.025 rather than $20.024 ... accountants like things that "foot".
I'm just curious, why in IEEE-754 any non zero float number divided by zero results in infinite value? It's a nonsense from the mathematical perspective. So I think that correct result for this operation is NaN.
Function f(x) = 1/x is not defined when x=0, if x is a real number. For example, function sqrt is not defined for any negative number and sqrt(-1.0f) if IEEE-754 produces a NaN value. But 1.0f/0 is Inf.
But for some reason this is not the case in IEEE-754. There must be a reason for this, maybe some optimization or compatibility reasons.
So what's the point?
It's a nonsense from the mathematical perspective.
Yes. No. Sort of.
The thing is: Floating-point numbers are approximations. You want to use a wide range of exponents and a limited number of digits and get results which are not completely wrong. :)
The idea behind IEEE-754 is that every operation could trigger "traps" which indicate possible problems. They are
Illegal (senseless operation like sqrt of negative number)
Overflow (too big)
Underflow (too small)
Division by zero (The thing you do not like)
Inexact (This operation may give you wrong results because you are losing precision)
Now many people like scientists and engineers do not want to be bothered with writing trap routines. So Kahan, the inventor of IEEE-754, decided that every operation should also return a sensible default value if no trap routines exist.
They are
NaN for illegal values
signed infinities for Overflow
signed zeroes for Underflow
NaN for indeterminate results (0/0) and infinities for (x/0 x != 0)
normal operation result for Inexact
The thing is that in 99% of all cases zeroes are caused by underflow and therefore in 99%
of all times Infinity is "correct" even if wrong from a mathematical perspective.
I'm not sure why you would believe this to be nonsense.
The simplistic definition of a / b, at least for non-zero b, is the unique number of bs that has to be subtracted from a before you get to zero.
Expanding that to the case where b can be zero, the number that has to be subtracted from any non-zero number to get to zero is indeed infinite, because you'll never get to zero.
Another way to look at it is to talk in terms of limits. As a positive number n approaches zero, the expression 1 / n approaches "infinity". You'll notice I've quoted that word because I'm a firm believer in not propagating the delusion that infinity is actually a concrete number :-)
NaN is reserved for situations where the number cannot be represented (even approximately) by any other value (including the infinities), it is considered distinct from all those other values.
For example, 0 / 0 (using our simplistic definition above) can have any amount of bs subtracted from a to reach 0. Hence the result is indeterminate - it could be 1, 7, 42, 3.14159 or any other value.
Similarly things like the square root of a negative number, which has no value in the real plane used by IEEE754 (you have to go to the complex plane for that), cannot be represented.
In mathematics, division by zero is undefined because zero has no sign, therefore two results are equally possible, and exclusive: negative infinity or positive infinity (but not both).
In (most) computing, 0.0 has a sign. Therefore we know what direction we are approaching from, and what sign infinity would have. This is especially true when 0.0 represents a non-zero value too small to be expressed by the system, as it frequently the case.
The only time NaN would be appropriate is if the system knows with certainty that the denominator is truly, exactly zero. And it can't unless there is a special way to designate that, which would add overhead.
NOTE:
I re-wrote this following a valuable comment from #Cubic.
I think the correct answer to this has to come from calculus and the notion of limits. Consider the limit of f(x)/g(x) as x->0 under the assumption that g(0) == 0. There are two broad cases that are interesting here:
If f(0) != 0, then the limit as x->0 is either plus or minus infinity, or it's undefined. If g(x) takes both signs in the neighborhood of x==0, then the limit is undefined (left and right limits don't agree). If g(x) has only one sign near 0, however, the limit will be defined and be either positive or negative infinity. More on this later.
If f(0) == 0 as well, then the limit can be anything, including positive infinity, negative infinity, a finite number, or undefined.
In the second case, generally speaking, you cannot say anything at all. Arguably, in the second case NaN is the only viable answer.
Now in the first case, why choose one particular sign when either is possible or it might be undefined? As a practical matter, it gives you more flexibility in cases where you do know something about the sign of the denominator, at relatively little cost in the cases where you don't. You may have a formula, for example, where you know analytically that g(x) >= 0 for all x, say, for example, g(x) = x*x. In that case the limit is defined and it's infinity with sign equal to the sign of f(0). You might want to take advantage of that as a convenience in your code. In other cases, where you don't know anything about the sign of g, you cannot generally take advantage of it, but the cost here is just that you need to trap for a few extra cases - positive and negative infinity - in addition to NaN if you want to fully error check your code. There is some price there, but it's not large compared to the flexibility gained in other cases.
Why worry about general functions when the question was about "simple division"? One common reason is that if you're computing your numerator and denominator through other arithmetic operations, you accumulate round-off errors. The presence of those errors can be abstracted into the general formula format shown above. For example f(x) = x + e, where x is the analytically correct, exact answer, e represents the error from round-off, and f(x) is the floating point number that you actually have on the machine at execution.
All experienced programmers in C# (I think this comes from C) are used to cast on of the integers in a division to get the decimal / double / float result instead of the int (the real result truncated).
I'd like to know why is this implemented like this? Is there ANY good reason to truncate the result if both numbers are integer?
C# traces its heritage to C, so the answer to "why is it like this in C#?" is a combination of "why is it like this in C?" and "was there no good reason to change?"
The approach of C is to have a fairly close correspondence between the high-level language and low-level operations. Processors generally implement integer division as returning a quotient and a remainder, both of which are of the same type as the operands.
(So my question would be, "why doesn't integer division in C-like languages return two integers", not "why doesn't it return a floating point value?")
The solution was to provide separate operations for division and remainder, each of which returns an integer. In the context of C, it's not surprising that the result of each of these operations is an integer. This is frequently more accurate than floating-point arithmetic. Consider the example from your comment of 7 / 3. This value cannot be represented by a finite binary number nor by a finite decimal number. In other words, on today's computers, we cannot accurately represent 7 / 3 unless we use integers! The most accurate representation of this fraction is "quotient 2, remainder 1".
So, was there no good reason to change? I can't think of any, and I can think of a few good reasons not to change. None of the other answers has mentioned Visual Basic which (at least through version 6) has two operators for dividing integers: / converts the integers to double, and returns a double, while \ performs normal integer arithmetic.
I learned about the \ operator after struggling to implement a binary search algorithm using floating-point division. It was really painful, and integer division came in like a breath of fresh air. Without it, there was lots of special handling to cover edge cases and off-by-one errors in the first draft of the procedure.
From that experience, I draw the conclusion that having different operators for dividing integers is confusing.
Another alternative would be to have only one integer operation, which always returns a double, and require programmers to truncate it. This means you have to perform two int->double conversions, a truncation and a double->int conversion every time you want integer division. And how many programmers would mistakenly round or floor the result instead of truncating it? It's a more complicated system, and at least as prone to programmer error, and slower.
Finally, in addition to binary search, there are many standard algorithms that employ integer arithmetic. One example is dividing collections of objects into sub-collections of similar size. Another is converting between indices in a 1-d array and coordinates in a 2-d matrix.
As far as I can see, no alternative to "int / int yields int" survives a cost-benefit analysis in terms of language usability, so there's no reason to change the behavior inherited from C.
In conclusion:
Integer division is frequently useful in many standard algorithms.
When the floating-point division of integers is needed, it may be invoked explicitly with a simple, short, and clear cast: (double)a / b rather than a / b
Other alternatives introduce more complication both the programmer and more clock cycles for the processor.
Is there ANY good reason to truncate the result if both numbers are integer?
Of course; I can think of a dozen such scenarios easily. For example: you have a large image, and a thumbnail version of the image which is 10 times smaller in both dimensions. When the user clicks on a point in the large image, you wish to identify the corresponding pixel in the scaled-down image. Clearly to do so, you divide both the x and y coordinates by 10. Why would you want to get a result in decimal? The corresponding coordinates are going to be integer coordinates in the thumbnail bitmap.
Doubles are great for physics calculations and decimals are great for financial calculations, but almost all the work I do with computers that does any math at all does it entirely in integers. I don't want to be constantly having to convert doubles or decimals back to integers just because I did some division. If you are solving physics or financial problems then why are you using integers in the first place? Use nothing but doubles or decimals. Use integers to solve finite mathematics problems.
Calculating on integers is faster (usually) than on floating point values. Besides, all other integer/integer operations (+, -, *) return an integer.
EDIT:
As per the request of the OP, here's some addition:
The OP's problem is that they think of / as division in the mathematical sense, and the / operator in the language performs some other operation (which is not the math. division). By this logic they should question the validity of all other operations (+, -, *) as well, since those have special overflow rules, which is not the same as would be expected from their math counterparts. If this is bothersome for someone, they should find another language where the operations perform as expected by the person.
As for the claim on perfomance difference in favor of integer values: When I wrote the answer I only had "folk" knowledge and "intuition" to back up the claim (hece my "usually" disclaimer). Indeed as Gabe pointed out, there are platforms where this does not hold. On the other hand I found this link (point 12) that shows mixed performances on an Intel platform (the language used is Java, though).
The takeaway should be that with performance many claims and intuition are unsubstantiated until measured and found true.
Yes, if the end result needs to be a whole number. It would depend on the requirements.
If these are indeed your requirements, then you would not want to store a decimal and then truncate it. You would be wasting memory and processing time to accomplish something that is already built-in functionality.
The operator is designed to return the same type as it's input.
Edit (comment response):
Why? I don't design languages, but I would assume most of the time you will be sticking with the data types you started with and in the remaining instance, what criteria would you use to automatically assume which type the user wants? Would you automatically expect a string when you need it? (sincerity intended)
If you add an int to an int, you expect to get an int. If you subtract an int from an int, you expect to get an int. If you multiple an int by an int, you expect to get an int. So why would you not expect an int result if you divide an int by an int? And if you expect an int, then you will have to truncate.
If you don't want that, then you need to cast your ints to something else first.
Edit: I'd also note that if you really want to understand why this is, then you should start looking into how binary math works and how it is implemented in an electronic circuit. It's certainly not necessary to understand it in detail, but having a quick overview of it would really help you understand how the low-level details of the hardware filter through to the details of high-level languages.
All the methods in System.Math takes double as parameters and returns parameters. The constants are also of type double. I checked out MathNet.Numerics, and the same seems to be the case there.
Why is this? Especially for constants. Isn't decimal supposed to be more exact? Wouldn't that often be kind of useful when doing calculations?
This is a classic speed-versus-accuracy trade off.
However, keep in mind that for PI, for example, the most digits you will ever need is 41.
The largest number of digits of pi
that you will ever need is 41. To
compute the circumference of the
universe with an error less than the
diameter of a proton, you need 41
digits of pi †. It seems safe to
conclude that 41 digits is sufficient
accuracy in pi for any circle
measurement problem you're likely to
encounter. Thus, in the over one
trillion digits of pi computed in
2002, all digits beyond the 41st have
no practical value.
In addition, decimal and double have a slightly different internal storage structure. Decimals are designed to store base 10 data, where as doubles (and floats), are made to hold binary data. On a binary machine (like every computer in existence) a double will have fewer wasted bits when storing any number within its range.
Also consider:
System.Double 8 bytes Approximately ±5.0e-324 to ±1.7e308 with 15 or 16 significant figures
System.Decimal 12 bytes Approximately ±1.0e-28 to ±7.9e28 with 28 or 29 significant figures
As you can see, decimal has a smaller range, but a higher precision.
No, - decimals are no more "exact" than doubles, or for that matter, any type. The concept of "exactness", (when speaking about numerical representations in a compuiter), is what is wrong. Any type is absolutely 100% exact at representing some numbers. unsigned bytes are 100% exact at representing the whole numbers from 0 to 255. but they're no good for fractions or for negatives or integers outside the range.
Decimals are 100% exact at representing a certain set of base 10 values. doubles (since they store their value using binary IEEE exponential representation) are exact at representing a set of binary numbers.
Neither is any more exact than than the other in general, they are simply for different purposes.
To elaborate a bit furthur, since I seem to not be clear enough for some readers...
If you take every number which is representable as a decimal, and mark every one of them on a number line, between every adjacent pair of them there is an additional infinity of real numbers which are not representable as a decimal. The exact same statement can be made about the numbers which can be represented as a double. If you marked every decimal on the number line in blue, and every double in red, except for the integers, there would be very few places where the same value was marked in both colors.
In general, for 99.99999 % of the marks, (please don't nitpick my percentage) the blue set (decimals) is a completely different set of numbers from the red set (the doubles).
This is because by our very definition for the blue set is that it is a base 10 mantissa/exponent representation, and a double is a base 2 mantissa/exponent representation. Any value represented as base 2 mantissa and exponent, (1.00110101001 x 2 ^ (-11101001101001) means take the mantissa value (1.00110101001) and multiply it by 2 raised to the power of the exponent (when exponent is negative this is equivilent to dividing by 2 to the power of the absolute value of the exponent). This means that where the exponent is negative, (or where any portion of the mantissa is a fractional binary) the number cannot be represented as a decimal mantissa and exponent, and vice versa.
For any arbitrary real number, that falls randomly on the real number line, it will either be closer to one of the blue decimals, or to one of the red doubles.
Decimal is more precise but has less of a range. You would generally use Double for physics and mathematical calculations but you would use Decimal for financial and monetary calculations.
See the following articles on msdn for details.
Double
http://msdn.microsoft.com/en-us/library/678hzkk9.aspx
Decimal
http://msdn.microsoft.com/en-us/library/364x0z75.aspx
Seems like most of the arguments here to "It does not do what I want" are "but it's faster", well so is ANSI C+Gmp library, but nobody is advocating that right?
If you particularly want to control accuracy, then there are other languages which have taken the time to implement exact precision, in a user controllable way:
http://www.doughellmann.com/PyMOTW/decimal/
If precision is really important to you, then you are probably better off using languages that mathematicians would use. If you do not like Fortran then Python is a modern alternative.
Whatever language you are working in, remember the golden rule:
Avoid mixing types...
So do convert a and b to be the same before you attempt a operator b
If I were to hazard a guess, I'd say those functions leverage low-level math functionality (perhaps in C) that does not use decimals internally, and so returning a decimal would require a cast from double to decimal anyway. Besides, the purpose of the decimal value type is to ensure accuracy; these functions do not and cannot return 100% accurate results without infinite precision (e.g., irrational numbers).
Neither Decimal nor float or double are good enough if you require something to be precise. Furthermore, Decimal is so expensive and overused out there it is becoming a regular joke.
If you work in fractions and require ultimate precision, use fractions. It's same old rule, convert once and only when necessary. Your rounding rules too will vary per app, domain and so on, but sure you can find an odd example or two where it is suitable. But again, if you want fractions and ultimate precision, the answer is not to use anything but fractions. Consider you might want a feature of arbitrary precision as well.
The actual problem with CLR in general is that it is so odd and plain broken to implement a library that deals with numerics in generic fashion largely due to bad primitive design and shortcoming of the most popular compiler for the platform. It's almost the same as with Java fiasco.
double just turns out to be the best compromise covering most domains, and it works well, despite the fact MS JIT is still incapable of utilising a CPU tech that is about 15 years old now.
[piece to users of MSDN slowdown compilers]
Double is a built-in type. Is is supported by FPU/SSE core (formerly known as "Math coprocessor"), that's why it is blazingly fast. Especially at multiplication and scientific functions.
Decimal is actually a complex structure, consisting of several integers.
So, we know that fractions such as 0.1, cannot be accurately represented in binary base, which cause precise problems (such as mentioned here: Formatting doubles for output in C#).
And we know we have the decimal type for a decimal representation of numbers... but the problem is, a lot of Math methods, do not supporting decimal type, so we have convert them to double, which ruins the number again.
so what should we do?
Oh, what should we do about the fact that most decimal fractions cannot be represented in binary? or for that matter, that binary fractions cannot be represented in Decimal ?
or, even, that an infinity (in fact, a non-countable infinity) of real numbers in all bases cannot be accurately represented in any computerized system??
nothing! To recall an old cliche, You can get close enough for government work... In fact, you can get close enough for any work... There is no limit to the degree of accuracy the computer can generate, it just cannot be infinite, (which is what would be required for a number representation scheme to be able to represent every possible real number)
You see, for every number representation scheme you can design, in any computer, it can only represent a finite number of distinct different real numbers with 100.00 % accuracy. And between each adjacent pair of those numbers (those that can be represented with 100% accuracy), there will always be an infinity of other numbers that it cannot represent with 100% accuracy.
so what should we do?
We just keep on breathing. It really isn't a structural problem. We have a limited precision but usually more than enough. You just have to remember to format/round when presenting the numbers.
The problem in the following snippet is with the WriteLine(), not in the calculation(s):
double x = 6.9 - 10 * 0.69;
Console.WriteLine("x = {0}", x);
If you have a specific problem, th post it. There usually are ways to prevent loss of precision. If you really need >= 30 decimal digits, you need a special library.
Keep in mind that the precision you need, and the rounding rules required, will depend on your problem domain.
If you are writing software to control a nuclear reactor, or to model the first billionth of a second of the universe after the big bang (my friend actually did that), you will need much higher precision than if you are calculating sales tax (something I do for a living).
In the finance world, for example, there will be specific requirements on precision either implicitly or explicitly. Some US taxing jurisdictions specify tax rates to 5 digits after the decimal place. Your rounding scheme needs to allow for that much precision. When much of Western Europe converted to the Euro, there was a very specific approach to rounding that was written into law. During that transition period, it was essential to round exactly as required.
Know the rules of your domain, and test that your rounding scheme satisfies those rules.
I think everyone implying:
Inverting a sparse matrix? "There's an app for that", etc, etc
Numerical computation is one well-flogged horse. If you have a problem, it was probably put to pasture before 1970 or even much earlier, carried forward library by library or snippet by snippet into the future.
you could shift the decimal point so that the numbers are whole, then do 64 bit integer arithmetic, then shift it back. Then you would only have to worry about overflow problems.
And we know we have the decimal type
for a decimal representation of
numbers... but the problem is, a lot
of Math methods, do not supporting
decimal type, so we have convert them
to double, which ruins the number
again.
Several of the Math methods do support decimal: Abs, Ceiling, Floor, Max, Min, Round, Sign, and Truncate. What these functions have in common is that they return exact results. This is consistent with the purpose of decimal: To do exact arithmetic with base-10 numbers.
The trig and Exp/Log/Pow functions return approximate answers, so what would be the point of having overloads for an "exact" arithmetic type?