Find minimum float greater than a double value

Find minimum float greater than a double value - c#

I was having an issue with the CUDNN_BN_MIN_EPSILON value being used in the cudnnBatchNormalizationForwardTraining function (see the docs here), and it turns out it was because I was passing the float value 1e-5f instead of double (I'm working with float values to save memory and speed up computation), and this value once converted to float was slightly less than 1e-5, which is the actual value of that constant.
After some trial and error, I found a decent approximation I'm now using:
const float CUDNN_BN_MIN_EPSILON = 1e-5f + 5e-13f;
I'm sure there's a better way to approach problems like this, so the question is:
Given a positive double value, what is the best (as in "reliable") way to find the minimum possible float value which (on its own and if/when converted to double) is strictly greater than the initial double value?
Another way to formulate this problem is that, given a double value d1 and a float value f1, d1 - (float)f1 should be the minimum possible negative value (as otherwise it'd mean that f1 was less than d1, which is not what we're looking for).
I did some basic trial and error (using 1e-5 as my target value):
// Check the initial difference
> 1e-5 - 1e-5f
2,5262124918247909E-13 // We'd like a small negative value here
// Try to add the difference to the float value
> 1e-5 - (1e-5f + (float)(1e-5 - 1e-5f))
2,5262124918247909E-13 // Same, probably due to approximation
// Double the difference (as a test)
> 1e-5 - (1e-5f + (float)((1e-5 - 1e-5f) * 2))
-6,5687345259044915E-13 // OK
With this approximation, the final float value is 1,00000007E-05, which looks fine.
But, that * 2 multiplication was completely arbitrary on my end, and I'm not sure it'll be reliable or the optimum possible thing to do there.
Is there a better way to achieve this?
Thanks!
EDIT: this is the (bad) solution I'm using now, will be happy to replace it with a better one!
/// <summary>
/// Returns the minimum possible upper <see cref="float"/> approximation of the given <see cref="double"/> value
/// </summary>
/// <param name="value">The value to approximate</param>
public static float ToApproximatedFloat(this double value)
=> (float)value + (float)((value - (float)value) * 2);
SOLUTION: this is the final, correct implementation (thanks to John Bollinger):
public static unsafe float ToApproximatedFloat(this double value)
{
// Obtain the bit representation of the double value
ulong bits = *((ulong*)&value);
// Extract and re-bias the exponent field
ulong exponent = ((bits >> 52) & 0x7FF) - 1023 + 127;
// Extract the significand bits and truncate the excess
ulong significand = (bits >> 29) & 0x7FFFFF;
// Assemble the result in 32-bit unsigned integer format, then add 1
ulong converted = (((bits >> 32) & 0x80000000u)
| (exponent << 23)
| significand) + 1;
// Reinterpret the bit pattern as a float
return *((float*)&converted);
}

In C:
#include <math.h>
float NextFloatGreaterThan(double x)
{
float y = x;
if (y <= x) y = nexttowardf(y, INFINITY);
return y;
}
If you do not want to use library routines, then replace nexttowardf(y, INFINITY) above with -NextBefore(-y), where NextBefore is taken from this answer and modified:
Change double to float and DBL_ to FLT_.
Change .625 to .625f.
Replace fmax(SmallestPositive, fabs(q)*Scale) with SmallestPositive < fabs(q)*Scale ? fabs(q)*Scale : SmallestPositive.
Replace fabs(q) with (q < 0 ? -q : q).
(Obviously, the routine could be converted from -NextBefore(-y) to NextAfter(y). That is left as an exercise for the reader.)

Inasmuch as you seem interested in the representation-level details, you'll be dependent on the representations of types float and double. In practice, however, it is very likely that that comes down to the basic "binary32" and "binary64" formats of IEEE-754. These have the general form of one sign bit, several bits of biased exponent, and a bunch of bits of significand, including, for normalized values, one implicit bit of significand.
Simple case
Given a double in IEEE-754 binary64 format whose value is no less than +2-126, what you want to do is
obtain the bit pattern of the original double value in a form that can be directly examined and manipulated. For example, as an unsigned 64-bit integer.
double d = 1e-5;
uint64_t bits;
memcpy(&bits, &d, 8);
extract and re-bias the exponent field
uint64_t exponent = ((bits >> 52) & 0x7FF) - 1023 + 127;
extract the significand bits and truncate the excess
uint64_t significand = (bits >> 29) & 0x7fffff;
assemble the result in 32-bit unsigned integer format
uint32_t float_bits = ((bits >> 32) & 0x80000000u)
| (exponent << 23)
| significand;
add one. Since you want a result strictly greater than the original double, this is correct regardless of whether all of the truncated significand bits were 0. It will correctly increment the exponent field if the addition overflows the significand bits. It may, however, produce the bit pattern of an infinity.
float_bits += 1;
store / copy / reinterpret the bit pattern as that of a float
float f;
memcpy(&f, &float_bits, 4);
Negative numbers
Given a negative double in binary64 format whose magnitude is no less than 2-126, follow the above procedure except subtract 1 from float_bits instead of adding one. Note that for exactly -2-126, this produces a subnormal binary32 (see below), which is the correct result.
Zeroes and very small numbers, including subnormals
IEEE 754 provides reduced-precision representations of non-zero numbers of very small magnitude. Such representations are called subnormal. Under some circumstances the minimum binary32 exceeding a given input binary64 is a subnormal, including for some inputs that are not binary64 subnormals.
Also, IEEE 754 provides signed zeroes, and -0 is a special case: the minimum binary32 strictly greater than -0 (either format) is the smallest positive subnormal number. Note: not +0, because according to IEEE 754, +0 and -0 compare equal via the normal comparison operators. The minimum positive, nonzero, subnormal binary32 value has bit pattern 0x00000001.
The binary64 values subject to these considerations have biased binary64 exponent fields with values less than or equal to the difference between the binary64 exponent bias and the binary32 exponent bias (896). This includes those with biased exponents of exactly 0, which characterize binary64 zeroes and subnormals. Examination of the rebiasing step in the simple-case procedure should lead you to conclude, correctly, that that procedure will produce the wrong result for such inputs.
Code for these cases is left as an exercise.
Infinities and NaNs
Inputs with all bits of the biased binary64 exponent field set represent either positive or negative infinity (when the binary64 significand has no bits set) or a not-a-number (NaN) value. Binary64 NaNs and positive infinity should convert to their binary32 equivalents. Negative infinity should perhaps convert to the negative binary32 value of greatest magnitude. These need to be handled as special cases.
Code for these cases is left as an exercise.

Related

What are the chances of Random.NextDouble being exactly 0?

The documentation of Random.NextDouble():
Returns a random floating-point number that is greater than or equal to 0.0, and less than 1.0.
So, it can be exactly 0. But what are the chances for that?
var random = new Random();
var x = random.NextDouble()
if(x == 0){
// probability for this?
}
It would be easy to calculate the probability for Random.Next() being 0, but I have no idea how to do it in this case...

As mentioned in comments, it depends on internal implementation of NextDouble. In "old" .NET Framework, and in modern .NET up to version 5, it looks like this:
protected virtual double Sample() {
return (InternalSample()*(1.0/MBIG));
}
InternalSample returns integer in 0 to Int32.MaxValue range, 0 included, int.MaxValue excluded. We can assume that the distribution of InternalSample is uniform (in the docs for Next method, which just calls InternalSample, there are clues that it is, and it seems there is no reason to use non-uniform distribution in general-purpose RNG for integers). That means every number is equally likely. Then, we have 2,147,483,647 numbers in distribution, and the probability to draw 0 is 1 / 2,147,483,647.
In modern .NET 6+ there are two implementations. First is used when you provide explicit seed value to Random constructor. This implementation is the same as above, and is used for compatibility reasons - so that old code relying on the seed value to produce deterministic results will not break while moving to the new .NET version.
Second implementation is a new one and is used when you do NOT pass seed into Random constructor. Source code:
public override double NextDouble() =>
// As described in http://prng.di.unimi.it/:
// "A standard double (64-bit) floating-point number in IEEE floating point format has 52 bits of significand,
// plus an implicit bit at the left of the significand. Thus, the representation can actually store numbers with
// 53 significant binary digits. Because of this fact, in C99 a 64-bit unsigned integer x should be converted to
// a 64-bit double using the expression
// (x >> 11) * 0x1.0p-53"
(NextUInt64() >> 11) * (1.0 / (1ul << 53));
We first obtain random 64-bit unsigned integer. Now, we could multiply it by 1 / 2^64 to obtain double in 0..1 range, but that would make the resulting distribution biased. double is represented by 53-bit mantissa (52 bits are explicit and one is implicit) ,exponent and sign. For all integer values exponent is the same, so that leaves us with 53 bits to represent integer values. But we have 64-bit integer here. This means integer values less than 2^53 can be represented exactly by double but bigger integers can not. For example:
ulong l1 = 1ul << 53;
ulong l2 = l1 + 1;
double d1 = l1;
double d2 = l2;
Console.WriteLine(d1 == d2);
Prints "true", so two different integers map to the same double value. That means if we just multiply our 64-bit integer by 1 / 2^64 - we'll get a biased non-uniform distribution, because many integers bigger than 2^53-1 will map to the same values.
So instead, we throw away 11 bits, and multiply the result by 1 / 2^53 to get uniform distibution in 0..1 range. The probability to get 0 is then 1 / 2^53 (1 / 9,007,199,254,740,992). This implementation is better than the old one, because is provides much more different doubles in 0 .. 1 range (2^53 compared to 2^32 in old one).
You also asked in comments:
If one knows how many numbers there are between 0 inclusive and 1
exclusive (according to IEEE 754), it would be possible to answer the
'probability' question, because 0 is one of all of them
That's not so. There are actually more than 2^53 representable numbers between 0..1 in IEEE 754. We have 52 bits of mantissa, then we have 11 bits of exponent, half of which is for negative exponents. Almost all negative exponents (rougly half of that 11 bit range) combined with mantissa gives us distinct value in 0..1 range.
Why we can't use full 0..1 range which IEEE allows us to generate random number? Because this range is not uniform (like the full double range is not uniform itself). For example there are more representable numbers in 0 .. 0.5 range than in 0.5 .. 1 range.

This is from a strictly academic persepective.
From Double Struct:
All floating-point numbers also have a limited number of significant
digits, which also determines how accurately a floating-point value
approximates a real number. A Double value has up to 15 decimal digits
of precision, although a maximum of 17 digits is maintained
internally. This means that some floating-point operations may lack
the precision to change a floating point value.
If only 15 decimal digits are significant, then your possible return values are:
0.000000000000000
To:
0.999999999999999
Said differently, you have 10^15 possible (comparably different, "distinct") values (see Permutations in the first answer):
10^15 = 1,000,000,000,000,000
Zero is just ONE of those possibilities:
1 / 1,000,000,000,000,000 = 0.000000000000001
Stated as a percentage:
0.0000000000001% chance of zero being randomly selected?
I think this is the closest "correct" answer you're going to get...
...whether it performs this way in practice is possibly a different story.

Just create a simple program, and let it run until you are satisfied by the number of tries done. (see: https://onlinegdb.com/ij1M50gRQ)
Random r = new Random();
Double d ;
int attempts=0;
int attempts0=0;
while (true) {
d = Math.Round(r.NextDouble(),3);
if(d==0) attempts0++;
attempts++;
if (attempts%1000000==0) Console.WriteLine($"Attempts: {attempts}, with {attempts0} times a 0 value, this is {Math.Round(100.0*attempts0/attempts,3)} %");
}
example output:
...
Attempts: 208000000, with 103831 times a 0 value, this is 0.05 %
Attempts: 209000000, with 104315 times a 0 value, this is 0.05 %
Attempts: 210000000, with 104787 times a 0 value, this is 0.05 %
Attempts: 211000000, with 105305 times a 0 value, this is 0.05 %
Attempts: 212000000, with 105853 times a 0 value, this is 0.05 %
Attempts: 213000000, with 106349 times a 0 value, this is 0.05 %
Attempts: 214000000, with 106839 times a 0 value, this is 0.05 %
...
Changing the value of d to be rounded to 2 decimals will return 0.5%

How the float type range is calculated (±1.5 * E−45 to ±3.4 * E+38)? [duplicate]

Excerpt from a book:
A float value consists of a 24-bit
signed mantissa and an 8-bit signed
exponent. The precision is approximately seven decimal digits.
Values
range from
-3.402823 × 10^38 to 3.402823 × 10^38
How to calculate this range? Can someone explain the binary arithmetic?

You need to read "What Every Computer Scientist Should Know About Floating-Point Arithmetic" which will explain how floating point numbers are stored, which will also answer your question.

I would definitely read the article to which Richard points. But if you need a simpler explanation, I hope this helps:
Basically, as you said, there is 1 sign bit, 8 bits for exponent, and 23 for fraction.
Then, using this equation (from Wikipedia)
N = (1 - 2s) * 2^(x-127) * (1 + m*2^-23)
where s is the sign bit, x is the exponent (minus the 127 bias), and m is the fractional part treated as a whole number (the equation above transforms the whole number into the appropriate fraction value).
Note, that the exponent value of 0xFF is reserved to represent infinity. So the largest exponent of a real value is 0xFE.
you see that the maximum value is
N = (1 - 2*0) * 2^(254-127) * (1 + (2^23 - 1) * 2^-23)
N = 1 * 2^127 * 1.999999
N = 3.4 x 10^34
The minimum value would be the same but with the sign bit set, which would simply negate the value to give you -3.4 X 10^34.
Q.E.D.

Is it safe to cast Math.Round result to float?

A colleague has written some code along these lines:
var roundedNumber = (float) Math.Round(someFloat, 2);
Console.WriteLine(roundedNumber);
I have an uncertainty about this code - is the number that gets written here even guaranteed to have 2 decimal places any more? It seems plausible to me that truncation of the double Math.Round(someFloat, 2) to float might result in a number whose string representation has more than 2 digits. Can anybody either provide an example of this (demonstrating that such a cast is unsafe) or else demonstrate somehow that it is safe to perform such a cast?

Assuming single and double precision IEEE754 representation and rules, I have checked for the first 2^24 integers i that
float(double( i/100 )) = float(i/100)
in other words, converting a decimal value with 2 decimal places twice (first to the nearest double, then to the nearest single precision float) is the same as converting the decimal directly to single precision, as long as the integer part of the decimal is not too large.
I have no guarantee for larger values.
The double approximation and the single approximation are different, but that's not really the question.
Converting twice is innocuous up to at least 167772.16, it's the same as if Math.Round would have done it directly in single precision.
Here is the testing code in Squeak/Pharo Smalltalk with ArbitraryPrecisionFloat package (sorry to not exhibit it in c# but the language does not really matter, only IEEE rules do).
(1 to: 1<<24)
detect: [:i |
(i/100.0 asArbitraryPrecisionFloatNumBits: 24) ~= (i/100 asArbitraryPrecisionFloatNumBits: 24) ]
ifNone: [nil].
EDIT
Above test was superfluous because, thanks to excellent reference provided by Mark Dickinson (Innocuous double rounding of basic arithmetic operations) , we know that doing float(double(x) / double(y)) produces a correctly-rounded value for x / y, as long as x and y are both representable as floats, which is the case for any 0 <= x <= 2^24 and for y=100.
EDIT
I have checked with numerators up to 2^30 (decimal value > 10 millions), and converting twice is still identical to converting once. Going further with an interpreted language is not good wrt global warming...

Floating point operations ambiguity [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why is floating point arithmetic in C# imprecise?
Why is there a bias in floating point ops? Any specific reason?
Output:
160
139
static void Main()
{
float x = (float) 1.6;
int y = (int)(x * 100);
float a = (float) 1.4;
int b = (int)(a * 100);
Console.WriteLine(y);
Console.WriteLine(b);
Console.ReadKey();
}

Any rational number that has a denominator that is not a power of 2 will lead to an infinite number of digits when represented as a binary. Here you have 8/5 and 7/5. Therefore there is no exact binary representation as a floating-point number (unless you have infinite memory).
The exact binary representation of 1.6 is 110011001100110011001100110011001100...
The exact binary representation of 1.4 is 101100110011001100110011001100110011...
Both values have an infinite number of digits (1100 is repeated endlessly).
float values have a precision of 24 bits. So the binary representation of any value will be rounded to 24 bits. If you round the given values to 24 bits you get:
1.6: 110011001100110011001101 (decimal 13421773) - rounded up
1.4: 101100110011001100110011 (decimal 11744051) - rounded down
Both values have an exponent of 0 (the first bit is 2^0 = 1, the second is 2^-1 = 0.5 etc.).
Since the first bit in a 24 bit value is 2^23 you can calculate the exact decimal values by dividing the 24 bit values (13421773 and 11744051) by two 23 times.
The values are: 1.60000002384185791015625 and 1.39999997615814208984375.
When using floating-point types you always have to consider that their precision is finite. Values that can be written exact as decimal values might be rounded up or down when represented as binaries. Casting to int does not respect that because it truncates the given values. You should always use something like Math.Round.
If you really need an exact representation of rational numbers you need a completely different approach. Since rational numbers are fractions you can use integers to represent them. Here is an example of how you can achieve that.
However, you can not write Rational x = (Rational)1.6 then. You have to write something like Rational x = new Rational(8, 5) (or new Rational(16, 10) etc.).

This is due to the fact that floating point arithmetic is not precise. When you set a to 1.4, internally it may not be exactly 1.4, just as close as can be made with machine precision. If it is fractionally less than 1.4, then multiplying by 100 and casting to integer will take only the integer portion which in this case would be 139. You will get far more technically precise answers but essentially this is what is happening.
In the case of your output for the 1.6 case, the floating point representation may actually be minutely larger than 1.6 and so when you multiply by 100, the total is slightly larger than 160 and so the integer cast gives you what you expect. The fact is that there is simply not enough precision available in a computer to store every real number exactly.
See this link for details of the conversion from floating point to integer types http://msdn.microsoft.com/en-us/library/aa691289%28v=vs.71%29.aspx - it has its own section.

The floating point types float (32 bit) and double (64 bit) have a limited precision and more over the value is represented as a binary value internally. Just as you cannot represent 1/7 precisely in a decimal system (~ 0.1428571428571428...), 1/10 cannot be represented precisely in a binary system.
You can however use the decimal type. It still has a limited (however high) precision, but the numbers a represented in a decimal way internally. Therefore a value like 1/10 is represented exactly like 0.1000000000000000000000000000 internally. 1/7 is still a problem for decimal. But at least you don't get a loss of precision by converting to binary and then back to decimal.
Consider using decimal.

Range of floating point numbers in .NET?

Excerpt from a book:
A float value consists of a 24-bit
signed mantissa and an 8-bit signed
exponent. The precision is approximately seven decimal digits.
Values
range from
-3.402823 × 10^38 to 3.402823 × 10^38
How to calculate this range? Can someone explain the binary arithmetic?

You need to read "What Every Computer Scientist Should Know About Floating-Point Arithmetic" which will explain how floating point numbers are stored, which will also answer your question.

I would definitely read the article to which Richard points. But if you need a simpler explanation, I hope this helps:
Basically, as you said, there is 1 sign bit, 8 bits for exponent, and 23 for fraction.
Then, using this equation (from Wikipedia)
N = (1 - 2s) * 2^(x-127) * (1 + m*2^-23)
where s is the sign bit, x is the exponent (minus the 127 bias), and m is the fractional part treated as a whole number (the equation above transforms the whole number into the appropriate fraction value).
Note, that the exponent value of 0xFF is reserved to represent infinity. So the largest exponent of a real value is 0xFE.
you see that the maximum value is
N = (1 - 2*0) * 2^(254-127) * (1 + (2^23 - 1) * 2^-23)
N = 1 * 2^127 * 1.999999
N = 3.4 x 10^34
The minimum value would be the same but with the sign bit set, which would simply negate the value to give you -3.4 X 10^34.
Q.E.D.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.