Extended precision floating point dangers in C# [duplicate] - c#

This question already has answers here:
Is floating-point math consistent in C#? Can it be?
(10 answers)
Closed 7 years ago.
I am writing a library for multiprecision arithmetic based on a paper I am reading. It is very important that I am able to guarantee the properties of floating point numbers I use. In particular, that they adhere to the IEEE 754 standard for double precision floating point numbers. Clearly I cannot guarantee the behavior of my code on an unexpected platform, but for x86 and x64 chipsets, which I am writing for, I am concerned about a particular hazard. Apparently, some or all x86 / x64 chipsets may make use of extended precision floating point numbers in their FPU registers, with 80 bits of precision. I cannot tolerate my arithmetic being handled in extended precision FPUs without being rounded to double precision after every operation because the proofs of correctness for the algorithms I am using rely on rounding to occur. I can easily identify cases in which extended precision could break these algorithms.
I am writing my code in C#. How can I guarantee certain values are rounded? In C, I would declare variables as volatile, forcing them to be written back to RAM. This is slow and I'd rather keep the numbers in registers as 64 bit floats, but correctness in these algorithms is the whole point, not speed. In any case, I need a solution for C#. If this seems in-feasible I will approach the problem in a different language.

The C# spec has this to say on the topic:
Only at excessive cost in performance can such hardware architectures be made to perform floating-point operations with less precision, and rather than require an implementation to forfeit both performance and precision, C# allows a higher precision type to be used for all floating-point operations. Other than delivering more precise results, this rarely has any measurable effects.
As a result, third-party libraries are required to simulate the behavior of a IEEE 754-compliant FPU. One such is SoftFloat, which creates a type SoftFloat that uses operator overloads to simulate a standard double behavior.

An obvious problem with 80-bit intermediate values is that it is very much up to the compiler and optimizer to decide when a value is truncated back to 64-bit. So different compilers may end up producing different results for the same sequence of floating point operations. An example is an operation like abcd. Depending on the availability of 80-bit floating point registers the compiler might round ab to 64-bit and leave c*d at 80-bit. I guess this is the root of your question where you need to eliminate this uncertainty.
I think your options are pretty limited in managed code. You could use a 3rd party software emulation like the other answer suggested. Or maybe you could try coercing the double to long and back. I have no way of checking if this actually works right now but you could try something like this between operations:
public static double Truncate64(double val)
{
unsafe
{
long l = *((long*) &val);
return *((double*) &l);
}
}
This also type checks:
public static double Truncate64(double val)
{
unsafe
{
return *((long*) &val);
}
}
Hope that helps.

Related

Most efficient way of multiplying and dividing fixed scale decimal numbers

Background
I work in the field of financial trading and am currently optimizing a real-time C# trading application.
Through extensive profiling I have identified that the performance of System.Decimal is now a bottleneck. As a result I am currently coding up a couple of more efficient fixed scale 64-bit 'decimal' structures (one signed, one unsigned) to perform base10 arithmatic. Using a fixed scale of 9 (i.e. 9 digits after the decimal point) means the underlying 64-bit integer can be used to represent the values:
-9,223,372,036.854775808 to 9,223,372,036.854775807
and
0 to 18,446,744,073.709551615
respectively.
This makes most operations trivial (i.e. comparisons, addition, subtraction). However, for multiplication and division I am currently falling back on the implementation provided by System.Decimal. I assume the external FCallMultiply method it invokes for multiplication uses either the Karatsuba or Toom–Cook algorithm under the covers. For division, I'm not sure which particular algorithm it would use.
Question
Does anyone know if, due to the fixed scale of my decimal values, there are any faster multiplication and division algorithms I can employ which are likely to out-perform System.Decimal.
I would appreciate your thoughts...
I have done something similar, by using the Schönhage Strassen algorithm.
I cannot find any sources now, but you can try to convert this code to the C# language.
P.S. i cannot say for sure about System.Decimal, but the "Karatsuba algorithm" is used by System.Numerics.BigInteger
My take of fixed point arithmetic (in general, not knowing about about C# or .NET in particular (VS Express acting up) (then, there's Fixed point math in c#? and Why no fixed point type in C#?):
The main point is a fixed scale - and that this is conceptual, first and foremost - the hardware couldn't care less about meaning/interpretation of numbers (or much anything) (unless it supports something, if for marketing reasons)
the easy: addition/subtraction - just ignore scaling
multiplication: compute the double-wide product, divide by scale
division: multiply (widened) dividend by scale and divide
the ugly - transcendental functions beyond exponentiation (exponentiate, multiply by scale to half that power)
in choosing a scale, don't forget conversion to and from digits, which may vastly outnumber multiplication&division (and give using a square a thought, see above …)
That said, "multiples of word size" and powers of two have been popular choices for scale due to speed in multiplying and dividing by such a scale. This still may make a difference with contemporary processors, if not for main ALUs of PCs - think SIMD extensions, GPUs, embedded …
Given what little I was able to discern of your application and requirements (consider disclosing more), three generic choices to consider are 10^9 (to the 9th power), 2^30 and 2^32. The latter representations may be called 34.30 and 32.32 for the bit lengths of their integral and fractional parts, respectively.
With a language that allows to create types (especially supporting operators in addition to invokable procedures), I deem designing and implementing that new type according the principle of least surprise important.

Floating point inconsistency between expression and assigned object

This surprised me - the same arithmetic gives different results depending on how its executed:
> 0.1f+0.2f==0.3f
False
> var z = 0.3f;
> 0.1f+0.2f==z
True
> 0.1f+0.2f==(dynamic)0.3f
True
(Tested in Linqpad)
What's going on?
Edit: I understand why floating point arithmetic is imprecise, but not why it would be inconsistent.
The venerable C reliably confirms that 0.1 + 0.2 == 0.3 holds for single-precision floats, but not double-precision floating points.
I strongly suspect you may find that you get different results running this code with and without the debugger, and in release configuration vs in debug configuration.
In the first version, you're comparing two expressions. The C# language allows those expressions to be evaluated in higher precision arithmetic than the source types.
In the second version, you're assigning the addition result to a local variable. In some scenarios, that will force the result to be truncated down to 32 bits - leading to a different result. In other scenarios, the CLR or C# compiler will realize that it can optimize away the local variable.
From section 4.1.6 of the C# 4 spec:
Floating point operations may be performed with higher precision than the result type of the operation. For example, some hardware architectures support an "extended" or "long double" floating point type with greater range and precision than the double type, and implicitly perform all floating point operations with the higher precision type. Only at excessive cost in performance can such hardware architectures be made to perform floating point operations with less precision. Rather than require an implementation to forfeit both performance and precision, C# allows a higher precision type to be used for all floating point operations. Other than delivering more precise results, this rarely has any measurable effects.
EDIT: I haven't tried compiling this, but in the comments, Chris says the first form isn't being evaluated at execution time at all. The above can still apply (I've tweaked my wording slightly) - it's just shifted the evaluation time of a constant from execution time to compile-time. So long as it behaves the same way as a valid evaluation, that seems okay to me - so the compiler's own constant expression evaluation can use higher-precision arithmetic too.

Math "pow" in Java and C# return slightly different results?

I am porting program from C# to java. I've faced a fact that
Java
Math.pow(0.392156862745098,1./3.) = 0.7319587495200227
C#
Math.Pow( 0.392156862745098, 1.0 / 3.0) =0.73195874952002271
this last digit leads to sufficient differences in further calculations. Is there any way to emulate c#'s pow?
Thanx
Just to confirm what Chris Shain wrote, I get the same binary values:
// Java
public class Test
{
public static void main(String[] args)
{
double input = 0.392156862745098;
double pow = Math.pow(input, 1.0/3.0);
System.out.println(Double.doubleToLongBits(pow));
}
}
// C#
using System;
public class Test
{
static void Main()
{
double input = 0.392156862745098;
double pow = Math.Pow(input, 1.0/3.0);
Console.WriteLine(BitConverter.DoubleToInt64Bits(pow));
}
}
Output of both: 4604768117848454313
In other words, the double values are exactly the same bit pattern, and any differences you're seeing (assuming you'd get the same results) are due to formatting rather than a difference in value. By the way, the exact value of that double is
0.73195874952002271118800535987247712910175323486328125
Now it's worth noting that distinctly weird things can happen in floating point arithmetic, particularly when optimizations allow 80-bit arithmetic in some situations but not others, etc.
As Henk says, if a difference in the last bit or two causes you problems, then your design is broken.
If your calculations are sensitive to this kind of difference then you will need other measures (a redesign).
this last digit leads to sufficient differences in further calculations
That's impossible, because they're the same number. A double doesn't have enough precision to distinguish between 0.7319587495200227 and 0.73195874952002271; they're both represented as
0.73195874952002271118800535987247712910175323486328125.
The difference is the rounding: Java is using 16 significant digits and C# is using 17. But that's just a display issue.
Both Java and C# return a IEEE floating point number (specifically, a double) from Math.Pow. The difference that you are seeing is almost certainly due to the formatting when you display the number as decimal. The underlying (binary) value is probably the same, and your math troubles lie elsewhere.
Floating-point arithmetic is inherently imprecise. You are claiming that the C# answer is "better" but neither of them are that accurate. For example, Wolfram Alpha (which is much more accurate indeed) gives these values:
http://www.wolframalpha.com/input/?i=Pow%280.392156862745098%2C+1.0+%2F+3.0%29
If a unit's difference in the 17th digit is causing later computations to go awry, then I think there's a problem with your math, not with Java's implementation of pow. You need to think about how to restructure your computations so that they don't rely on such minor differences.
Seventeen digits' precision is the best any IEEE floating point number can do, regardless of language:
http://en.wikipedia.org/wiki/Double-precision_floating-point_format

Why is the division result between two integers truncated?

All experienced programmers in C# (I think this comes from C) are used to cast on of the integers in a division to get the decimal / double / float result instead of the int (the real result truncated).
I'd like to know why is this implemented like this? Is there ANY good reason to truncate the result if both numbers are integer?
C# traces its heritage to C, so the answer to "why is it like this in C#?" is a combination of "why is it like this in C?" and "was there no good reason to change?"
The approach of C is to have a fairly close correspondence between the high-level language and low-level operations. Processors generally implement integer division as returning a quotient and a remainder, both of which are of the same type as the operands.
(So my question would be, "why doesn't integer division in C-like languages return two integers", not "why doesn't it return a floating point value?")
The solution was to provide separate operations for division and remainder, each of which returns an integer. In the context of C, it's not surprising that the result of each of these operations is an integer. This is frequently more accurate than floating-point arithmetic. Consider the example from your comment of 7 / 3. This value cannot be represented by a finite binary number nor by a finite decimal number. In other words, on today's computers, we cannot accurately represent 7 / 3 unless we use integers! The most accurate representation of this fraction is "quotient 2, remainder 1".
So, was there no good reason to change? I can't think of any, and I can think of a few good reasons not to change. None of the other answers has mentioned Visual Basic which (at least through version 6) has two operators for dividing integers: / converts the integers to double, and returns a double, while \ performs normal integer arithmetic.
I learned about the \ operator after struggling to implement a binary search algorithm using floating-point division. It was really painful, and integer division came in like a breath of fresh air. Without it, there was lots of special handling to cover edge cases and off-by-one errors in the first draft of the procedure.
From that experience, I draw the conclusion that having different operators for dividing integers is confusing.
Another alternative would be to have only one integer operation, which always returns a double, and require programmers to truncate it. This means you have to perform two int->double conversions, a truncation and a double->int conversion every time you want integer division. And how many programmers would mistakenly round or floor the result instead of truncating it? It's a more complicated system, and at least as prone to programmer error, and slower.
Finally, in addition to binary search, there are many standard algorithms that employ integer arithmetic. One example is dividing collections of objects into sub-collections of similar size. Another is converting between indices in a 1-d array and coordinates in a 2-d matrix.
As far as I can see, no alternative to "int / int yields int" survives a cost-benefit analysis in terms of language usability, so there's no reason to change the behavior inherited from C.
In conclusion:
Integer division is frequently useful in many standard algorithms.
When the floating-point division of integers is needed, it may be invoked explicitly with a simple, short, and clear cast: (double)a / b rather than a / b
Other alternatives introduce more complication both the programmer and more clock cycles for the processor.
Is there ANY good reason to truncate the result if both numbers are integer?
Of course; I can think of a dozen such scenarios easily. For example: you have a large image, and a thumbnail version of the image which is 10 times smaller in both dimensions. When the user clicks on a point in the large image, you wish to identify the corresponding pixel in the scaled-down image. Clearly to do so, you divide both the x and y coordinates by 10. Why would you want to get a result in decimal? The corresponding coordinates are going to be integer coordinates in the thumbnail bitmap.
Doubles are great for physics calculations and decimals are great for financial calculations, but almost all the work I do with computers that does any math at all does it entirely in integers. I don't want to be constantly having to convert doubles or decimals back to integers just because I did some division. If you are solving physics or financial problems then why are you using integers in the first place? Use nothing but doubles or decimals. Use integers to solve finite mathematics problems.
Calculating on integers is faster (usually) than on floating point values. Besides, all other integer/integer operations (+, -, *) return an integer.
EDIT:
As per the request of the OP, here's some addition:
The OP's problem is that they think of / as division in the mathematical sense, and the / operator in the language performs some other operation (which is not the math. division). By this logic they should question the validity of all other operations (+, -, *) as well, since those have special overflow rules, which is not the same as would be expected from their math counterparts. If this is bothersome for someone, they should find another language where the operations perform as expected by the person.
As for the claim on perfomance difference in favor of integer values: When I wrote the answer I only had "folk" knowledge and "intuition" to back up the claim (hece my "usually" disclaimer). Indeed as Gabe pointed out, there are platforms where this does not hold. On the other hand I found this link (point 12) that shows mixed performances on an Intel platform (the language used is Java, though).
The takeaway should be that with performance many claims and intuition are unsubstantiated until measured and found true.
Yes, if the end result needs to be a whole number. It would depend on the requirements.
If these are indeed your requirements, then you would not want to store a decimal and then truncate it. You would be wasting memory and processing time to accomplish something that is already built-in functionality.
The operator is designed to return the same type as it's input.
Edit (comment response):
Why? I don't design languages, but I would assume most of the time you will be sticking with the data types you started with and in the remaining instance, what criteria would you use to automatically assume which type the user wants? Would you automatically expect a string when you need it? (sincerity intended)
If you add an int to an int, you expect to get an int. If you subtract an int from an int, you expect to get an int. If you multiple an int by an int, you expect to get an int. So why would you not expect an int result if you divide an int by an int? And if you expect an int, then you will have to truncate.
If you don't want that, then you need to cast your ints to something else first.
Edit: I'd also note that if you really want to understand why this is, then you should start looking into how binary math works and how it is implemented in an electronic circuit. It's certainly not necessary to understand it in detail, but having a quick overview of it would really help you understand how the low-level details of the hardware filter through to the details of high-level languages.

What is the recommended data type for scientific calculation in .Net?

What is the most recommended data type to use in scientific calculation in .Net? Is it float, double or something else?
Scientific values tend to be "natural" values (length, mass, time etc) where there's a natural degree of imprecision to start with - but where you may well want very, very large or very, very small numbers. For these values, double is generally a good idea. It's fast (with hardware support almost everywhere), scales up and down to huge/tiny values, and generally works fine if you're not concerned with exact decimal values.
decimal is a good type for "artificial" numbers where there's an exact value, almost always represented naturally as a decimal - the canonical example for this is currency. However, it's twice as expensive as double in terms of storage (8 bytes per value instead of 4), has a smaller range (due to a more limited exponent range) and is significantly slower due to a lack of hardware support.
I'd personally only use float if storage was an issue - it's amazing how quickly the inaccuracies can build up when you only have around 7 significant decimal places.
Ultimately, as the comment from "bears will eat you" suggests, it depends on what values you're talking about - and of course what you plan to do with them. Without any further information I suspect that double is a good starting point - but you should really make the decision based on the individual situation.
Well, of course the term “scientific calculation” is a bit vague, but in general, it’s double.
float is largely for compatibility with libraries that expect 32-bit floating-point numbers. The performance of float and double operations (like addition) is exactly the same, so new code should always use double because it has greater precision.
However, the x86 JITter will never inline functions that take or return a float, so using float in methods could actually be slower. Once again, this is for compatibility: if it were inlined, the execution engine would skip a conversion step that reduces its precision, and thus the JITter could inadvertantly change the result of some calculations if it were to inline such functions.
Finally, there’s also decimal. Use this whenever it is important to have a certain number of decimal places. The stereotypical use-case is currency operations, but of course it supports more than 2 decimal places — it’s actually an 80-bit piece of data.
If even the accuracy of 64-bit double is not enough, consider using an external library for arbitrary-precision numbers, but of course you will only need that if your specific scientific use-case specifically calls for it.
Double seems to be the most reliable data type for such operations. Even WPF uses it extensively.
Be aware that decimals are much more expensive to use than floats/doubles (in addition to what Jon Skeet and Timwi wrote).
I'd recommend double unless you need the value to be exact; decimal is for financial calculations that need this exactitude. Scientific calculations tolerate small errors because you can't exactly measure 1 meter anyways. Float only helps if storage is a problem (ie. huge matrices).

Categories

Resources