I have a problem that I want to pack very specific data into specific bits and store them in a float. I already investigated ways of simply packing numbers into floats, but using simple math won't work for what I want because I need to use the full 32 bits to store very specific values. If it was a 32 bit integer, this would be very easy.
So what I want to do is encode the data as a 32 bit integer and then turn this bit data directly into a float with all the bits remaining the same (unless someone has a better suggestion on how to do it). What languages will allow me to do a conversion like this? Obviously, not javascript and python because they don't support 32 bit floats. Will C# or C++ do it?
I need to decode the data in a GLSL or HLSL vertex shader. The shader would, of course, receive the 32 bit float. Is there an operator that will allow me to turn the float directly to integer with all the same bits instead of an ordinary cast? Or perhaps some other way to read the bits directly?
UPDATE: Eric Postpischil showed how to easily do the direct conversion in C in an answer below. Now I just need to know if there's a way to do a direct conversion from float to int or bit data in a vertex shader. Can anyone help on that part?
You can do this in C with:
#include <stdint.h>
float IntegerToFloat(uint32_t u)
{
return (union { uint32_t u; float f; }) {u} .f;
}
uint32_t FloatToInteger(float f)
{
return (union { float f; uint32_t u; }) {f} .u;
}
Naturally, this requires that float be 32 bits in the C implementation, and that uint32_t be a supported type (but you can use another 32-bit integer type if it is not, likely unsigned int). Some of the resulting float values may be NaNs, which might not remain unchanged in certain operations, such as conversion for printing or display and conversion back. Even normal float values will not generally remain unchanged unless they are displayed with sufficient precision and the C implementation uses correct rounding for decimal-to-binary and binary-to-decimal conversions.
So abusing the bits like this is a bad idea unless it is compelled with no alternative.
Related
I want to change a value, of, let's say, type int to be of type short, and making the value itself be "normalized" to the maximum value short can store - that is, so int.MaxValue would convert into short.MaxValue, and vice versa.
Here's an example using floating-point math to demonstrate:
public static short Rescale(int value){
float normalized = (float)value / int.MaxValue; // normalize the value to -1.0 to 1.0
float rescaled = normalized * (float)(short.MaxValue);
return (short)(rescaled);
}
While this works, it seems like using floating-point math is really inefficient, and can be improved, as we're dealing with binary data here. I tried using bit-shifting, but with to no avail.
Both signed and unsigned values are going to be processed - that isn't really an issue with the floating point solution, but when bit-shifting and doing other bit-manipulation, that makes things much more difficult.
This code will be used in quite a performance heavy context - it will be called 512 times every ~20 milliseconds, so performance is pretty important here.
How can I do this with bit-manipulation (or plain old integer algebra, if bit manipulation isn't necessary) and avoid floating-point math when we're operating on integer values?
You should use the shift operator. It is very fast.
int is 32bits, short is 16, so shift 16 bits right to scale your int to a short:
int x = 208908324 ;
//32 bits vs 16 bits.
short k = (short) (x >> 16);
Just reverse the process for scaling up. Obviously the lower bits will be filled with zeros.
When I Initialize a ulong with the value 18446744073709551615 and then add a 1 to It and display to the Console It displays a 0 which is totally expected.
I know this question sounds stupid but I have to ask It. if my Computer has a 64-bit architecture CPU how is my calculator able to work with larger numbers than 18446744073709551615?
I suppose floating-point has a lot to do here.
I would like to know exactly how this happens.
Thank you.
working with larger numbers than 18446744073709551615
"if my Computer has a 64-bit architecture CPU" --> The architecture bit size is largely irrelevant.
Consider how you are able to add 2 decimal digits whose sum is more than 9. There is a carry generated and then used when adding the next most significant decimal place.
The CPU can do the same but with base 18446744073709551616 instead of base 10. It uses a carry bit as well as a sign and overflow bit to perform extended math.
I suppose floating-point has a lot to do here.
This is nothing to do with floating point.
; you say you're using ulong, which means your using unsigned 64-but arithmetic. The largest value you can store is therefore "all ones", for 64 bits - aka UInt64.MaxValue, which as you've discovered: https://learn.microsoft.com/en-us/dotnet/api/system.uint64.maxvalue
If you want to store arbitrarily large numbers: there are APIs for that - for example BigInteger. However, arbitrary size cones at a cost, so it isn't the default, and certainly isn't what you get when you use ulong (or double, or decimal, etc - all the compiler-level numeric types have fixed size).
So: consider using BigInteger
You either way have a 64 bits architecture processor and limited to doing 64 bits math - your problem is a bit hard to explain without taking an explicit example of how this is solved with BigInteger in System.Numerics namespace, available in .NET Framework 4.8 for example. The basis is to 'decompose' the number into an array representation.
mathematical expression 'decompose' here meaning :
"express (a number or function) as a combination of simpler components."
Internally BigInteger uses an internal array (actually multiple internal constructs) and a helper class called BigIntegerBuilder. In can implicitly convert an UInt64 integer without problem, for even bigger numbers you can use the + operator for example.
BigInteger bignum = new BigInteger(18446744073709551615);
bignum += 1;
You can read about the implicit operator here:
https://referencesource.microsoft.com/#System.Numerics/System/Numerics/BigInteger.cs
public static BigInteger operator +(BigInteger left, BigInteger right)
{
left.AssertValid();
right.AssertValid();
if (right.IsZero) return left;
if (left.IsZero) return right;
int sign1 = +1;
int sign2 = +1;
BigIntegerBuilder reg1 = new BigIntegerBuilder(left, ref sign1);
BigIntegerBuilder reg2 = new BigIntegerBuilder(right, ref sign2);
if (sign1 == sign2)
reg1.Add(ref reg2);
else
reg1.Sub(ref sign1, ref reg2);
return reg1.GetInteger(sign1);
}
In the code above from ReferenceSource you can see that we use the BigIntegerBuilder to add the left and right parts, which are also BigInteger constructs.
Interesting, it seems to keep its internal structure into an private array called "_bits", so that is the answer to your question. BigInteger keeps track of an array of 32-bits valued integer array and is therefore able to handle big integers, even beyond 64 bits.
You can drop this code into a console application or Linqpad (which has the .Dump() method I use here) and inspect :
BigInteger bignum = new BigInteger(18446744073709551615);
bignum.GetType().GetField("_bits",
BindingFlags.NonPublic | BindingFlags.Instance).GetValue(bignum).Dump();
A detail about BigInteger is revealed in a comment in the source code of BigInteger on Reference Source. So for integer values, BigInteger stores the value in the _sign field, for other values the field _bits is used.
Obviously, the internal array needs to be able to be converted into a representation in the decimal system (base-10) so humans can read it, the ToString() method converts the BigInteger to a string representation.
For a better in-depth understanding here, consider doing .NET source stepping to step way into the code how you carry out the mathematics here. But for a basic understanding, the BigInteger uses an internal representation of which is composed with 32 bits array which is transformed into a readable format which allows bigger numbers, bigger than even Int64.
// For values int.MinValue < n <= int.MaxValue, the value is stored in sign
// and _bits is null. For all other values, sign is +1 or -1 and the bits are in _bits
Is there any convention over the algorithm used to make the layouts of structs on C?
I want to be able to have a code running in a vm to be able to have structures compatible with their C counterparts, just like C# interop works. For this I will need to know how the alignment algorithm works. I gather there must be a convetion for that, as it works nicely on C#. I have in mind the probable algorithm they have used to work this out, but I haven't found any proof it is the right one.
Here's how I think it works:
for each declared field (by order of declaration)
See if the field fits in the remaining bytes (until next alignment)
If it doesn't fit, align this field; otherwise add it to current offset
for example, on a 32-bit system for a struct like:
{
byte b1;
byte b2;
int32 i1;
byte b3;
}
would be like this with this algorithm:
{
byte b1;
byte b2;
byte[2] align1;
int32 i1;
byte b3;
byte[3] align2;
}
In general, structure alignment in C depends on the compiler used, and especially the compiler options in effect at the time the structure declaration is processed. You can't make any general assumptions except to say that for a particular structure in a particular program compiled with particular settings, the structure layout can be determined.
That said, your guess closely matches what most compilers are likely to do with default alignment settings.
First off, no I am not a student...just a C# guy porting a C++ library.
What do these two crazy lines mean? What are they equivalent to in C#? I'm mostly concerned with the size_t and sizeof. Not concerned about static_cast or assert..I know how to deal with those.
size_t Index = static_cast<size_t>((y - 1620) / 2);
assert(Index < sizeof(DeltaTTable)/sizeof(double));
y is a double and DeltaTTable is a double[]. Thanks in advance!
size_t is a typedef for an unsigned integer type. It is used for sizes of things, and may be 32 or 64 bits in size. The particular size of a size_t is implementation defined, but it is unsigned.
I suppose in C# you could use a 64-bit unsigned integer type.
All sizeof does is return the size in bytes of a C++ type. Every type takes up a certain quantity of room, and sizeof returns that size.
What your code is doing is computing the number of doubles (64-bit floats) that the DeltaTTable takes up. Essentially, it's ensuring that the table is larger than some size based on y, whatever that is.
There is no equivalent of sizeof in C#, nor does it need it. There is no reason for you to port this code to C#.
The bad news first you can't do that in C#. There's no static cast only dynamic casts. However the good news is it doesn't matter.
The two lines of code is asserting that the index is in bounds of the table so that the code won't accidentally read some arbitrary memory location. The CLR takes care of that for you. So when porting just ignore those lines they are automatically there for you any ways.
Of course this is based on an assumption based on the pattern of the code. There's no information on what Y represents and how Index is used
sizeOf calculates how much memory in bytes the DeltaTable type takes.
There is not equivalent to calculate the size like this in c# AFAIK.
I guess size_t much be a struct type in C++ code.
How can i convert double value to binary value.
i have some value like this below 125252525235558554452221545332224587265 i want to convert this to binary format..so i am keeping it in double and then trying to convert to binary (1 & 0's).. i am using C#.net
Well, you haven't specified a platform or what sort of binary value you're interested in, but in .NET there's BitConverter.DoubleToInt64Bits which lets you get at the IEEE 754 bits making up the value very easily.
In Java there's Double.doubleToLongBits which does the same thing.
Note that if you have a value such as "125252525235558554452221545332224587265" then you've got more information than a double can store accurately in the first place.
In C, you can do it for instance this way, which is a classic use of the union construct:
int i;
union {
double x;
unsigned char byte[sizeof (double)];
} converter;
converter.x = 5.5555555555556e18;
for(i = 0; i < sizeof converter.byte; i++)
printf("%02x", converter.byte[i]);
If you stick this in a main() and run it, it might print something like this:
~/src> gcc -o floatbits floatbits.c
~/src> ./floatbits
ba b5 f6 15 53 46 d3 43
Note though that this, of course, is platform-dependent in its endianness. The above is from a Linux system running on a Sempron CPU, i.e. it's little endian.
A decade late but hopefully this will help someone:
// Converts a double value to a string in base 2 for display.
// Example: 123.5 --> "0:10000000101:1110111000000000000000000000000000000000000000000000"
// Created by Ryan S. White in 2020, Released under the MIT license.
string DoubleToBinaryString(double val)
{
long v = BitConverter.DoubleToInt64Bits(val);
string binary = Convert.ToString(v, 2);
return binary.PadLeft(64, '0').Insert(12, ":").Insert(1, ":");
}
If you mean you want to do it yourself, then this is not a programming question.
If you want to make a computer do it, the easiest way is to use a floating point input routine and then display the result in its hex form. In C++:
double f = atof ("5.5555555555556E18");
unsigned char *b = (unsigned char *) &f;
for (int j = 0; j < 8; ++j)
printf (" %02x", b [j]);
A double value already IS a binary value. It is just a matter of the representation that you wish it to have. In a programming language when you call it a double, then the language that you use will interpret it in one way. If you happen to call the same chunk of memory an int, then it is not the same number.
So it depends what you really want... If you need to write it to disk or to network, then you need to think about BigEndian/LittleEndian.
For these huge numbers (who cannot be presented accurately using a double) you need to use some specialized class to hold the information needed.
C# provides the Decimal class:
The Decimal value type represents decimal numbers ranging from positive 79,228,162,514,264,337,593,543,950,335 to negative 79,228,162,514,264,337,593,543,950,335. The Decimal value type is appropriate for financial calculations requiring large numbers of significant integral and fractional digits and no round-off errors. The Decimal type does not eliminate the need for rounding. Rather, it minimizes errors due to rounding. For example, the following code produces a result of 0.9999999999999999999999999999 rather than 1.
If you need bigger precision than this, you need to make your own class I guess. There is one here for ints: http://sourceforge.net/projects/cpp-bigint/ although it seems to be for c++.