Storing two float values in a single float variable - c#

I'd like to store two float values in a single 32 bit float variable. The encoding will happen in C# while the decoding is to be done in a HLSL shader.
The best solution I've found so far is hard-wiring the offset of the decimal in the encoded values and storing them as integer and decimal of the "carrier" float:
123.456 -> 12.3 and 45.6
It can't handle negative values but that's ok.
However I was wondering if there is a better way to do this.
EDIT: A few more details about the task:
I'm working with a fixed data structure in Unity where the vertex data is stored as floats. (Float2 for a UV, float3 the normal, and so on.) Apparently there is no way to properly add extra data so I have to work within these limits, that's why I figured it was all down to a more general issue of encoding data. For example I could sacrifice the secondary UV data to transfer the 2x2 extra data channels.
The target is shader model 3.0 but I wouldn't mind if the decoding was working reasonably on SM2.0 too.
Data loss is fine as long as it's "reasonable". The expected value range is 0..64 but as I come to think of it 0..1 would be fine too since that is cheap to remap to any range inside the shader. The important thing is to keep precision as high as possible. Negative values are not important.

Following Gnietschow's recommendation I adapted the algo of YellPika. (It's C# for Unity 3d.)
float Pack(Vector2 input, int precision)
Vector2 output = input;
output.x = Mathf.Floor(output.x * (precision - 1));
output.y = Mathf.Floor(output.y * (precision - 1));
return (output.x * precision) + output.y;
Vector2 Unpack(float input, int precision)
Vector2 output =;
output.y = input % precision;
output.x = Mathf.Floor(input / precision);
return output / (precision - 1);
The quick and dirty testing produced the following stats (1 million random value pairs in the 0..1 range):
Precision: 2048 | Avg error: 0.00024424 | Max error: 0.00048852
Precision: 4096 | Avg error: 0.00012208 | Max error: 0.00024417
Precision: 8192 | Avg error: 0.00011035 | Max error: 0.99999940
Precision of 4096 seems to be the sweet spot. Note that both packing and unpacking in these tests ran on the CPU so the results could be worse on a GPU if it cuts corners with float precision.
Anyway, I don't know if this is the best algorithm but it seems good enough for my case.


Change an integral value's data type while keeping it normalized to the maximum value of said data type in C#

I want to change a value, of, let's say, type int to be of type short, and making the value itself be "normalized" to the maximum value short can store - that is, so int.MaxValue would convert into short.MaxValue, and vice versa.
Here's an example using floating-point math to demonstrate:
public static short Rescale(int value){
float normalized = (float)value / int.MaxValue; // normalize the value to -1.0 to 1.0
float rescaled = normalized * (float)(short.MaxValue);
return (short)(rescaled);
While this works, it seems like using floating-point math is really inefficient, and can be improved, as we're dealing with binary data here. I tried using bit-shifting, but with to no avail.
Both signed and unsigned values are going to be processed - that isn't really an issue with the floating point solution, but when bit-shifting and doing other bit-manipulation, that makes things much more difficult.
This code will be used in quite a performance heavy context - it will be called 512 times every ~20 milliseconds, so performance is pretty important here.
How can I do this with bit-manipulation (or plain old integer algebra, if bit manipulation isn't necessary) and avoid floating-point math when we're operating on integer values?
You should use the shift operator. It is very fast.
int is 32bits, short is 16, so shift 16 bits right to scale your int to a short:
int x = 208908324 ;
//32 bits vs 16 bits.
short k = (short) (x >> 16);
Just reverse the process for scaling up. Obviously the lower bits will be filled with zeros.

Is there a way to compress Y rotation axis to one byte?

I am making a Unity Multiplayer game, and I wanted to compress the Y rotation axis from sending the whole Quaternion to just sending one byte.
My first compression attempt:
Instead of sending Quaternion, I have just sent a Y-axis float value
Result: 16 bytes -> 4 bytes (12 bytes saved overall)
Second compression attempt:
I have cached lastSentAxis variable (float) which contains the last Y-axis value that has been sent to the server
When a player changes their rotation (looks right/left), then a new Y-axis is compared to the cached one, and a delta value is prepared (delta is guaranteed to be less than 255).
Then, I create a new sbyte - which contains rotation way (-1, if turned left, 1, if turned right)
Result: 4 bytes -> 2 bytes (2 bytes saved, 14 overall)
Third compression attempt (failed)
Define a byte flag instead of creating a separated byte mentioned before (1 - left, 2 - right)
Get a delta rotation value (as mentioned previously), but add it to the byte flag
PROBLEM: I have looped through 0 to 255 to find which numbers will collide with the byte flag.
POTENTIAL SOLUTION: Check if flag + delta is in the colliding number list. If yes, don't send a rotation request.
Every X requests, send a correction float value
Potential result: 2 bytes -> 1 byte (1 byte saved, 15 overall)
My question is, is it possible to make a third compression attempt in a more... proper way or my potential solution is only possible thing I can achieve?
I would not claim that you saved overall 15 bytes ^^
If you only need one component of the rotation anyway then the first step of syncing a single float (4 bytes) seems actually pretty obvious ;)
I would also say that going beyond that sounds a bit like an unnecessary micro optimization.
The delta sync is quite clever and at first glance is a 100% improvement from 4 bytes to 2 bytes.
it is also quite error prone and could go desync if only one single transmission fails.
this of course lowers the precision down to 1 degree integer steps instead of a full float value.
Honestly I would stick to the 4 bytes just for stability and precision.
2 bytes - about 0.0055° precision
With 2 bytes you can actually go way better than your attempt!
Why waste an entire byte just for the sign of the value?
use a short
uses a single bit for the sign
still has 15 bits left for the value!
You just would have to map your floating point range of -180 to 180 to the range -32768 to 32767.
// your delta between -180 and 180
float actualAngleDelta;
var shortAngleDelta = (short)Mathf.RondToInt(actualAngleDelta / 180f * shortMaxValue);
var sendBytes = BitConverter.GetBytes(shortAngleDelta);
short shortAngleDelta = BitConverter.ToInt16(receivedBytes);
float actualAngleDelta = (float) shortAngleDelta / (float)short.MaxValue * 360f;
But honestly then you should rather not sync the delta but the actual value.
So, use a ushort!
It covers values from 0 to 65535 so just map the possible 360 degrees on that. Sure you lose a little bit on precision but not down to full degrees ;)
// A value between 0 and 360
float actualAngle;
ushort ushortAngle = (ushort) Mathf.RoundToInt((actualAngle % 360f) / 360f * ushort.MaxValue);
byte[] sendBytes = BitConverter.GetBytes(ushortAngle);
ushort ushortAngle = BitConverter.ToUInt16(receivedBytes, 0);
float actualAngle = (float)ushortAngle / (float)ushort.MaxValue * 360f;
Both maintains a precision down to about 0.0055 (= 360/65535) degrees!
Single byte - about 1.41° precision
If a lower precision is an option for you anyway you could however go totally fancy and say you don't sync every exact rotation angle in degrees but rather divide a circle not by 360 but by 256 steps.
Then you could map the delta to your lesser grained "degree" angles and could cover the entire circle in a single byte:
byte sendByte = (byte)Mathf.RoundToInt((actualAngle % 360f) / 360f * (float)byte.MaxValue);
float actualAngle = receivedByte / (float)byte.MaxValue * 360f;
which would have a precision of about 1.4 degrees.
BUT honestly, is all this forth and back calculations really worth the 2/3 saved bytes?

How can I multiply and divide integers without bigger intermediate types?

Currently, I'm developing some fuzzy logic stuff in C# and want to achieve this in a generic way. For simplicity, I can use float, double and decimal to process an interval [0, 1], but for performance, it would be better to use integers. Some thoughts about symmetry also led to the decision to omit the highest value in unsigned and the lowest value in signed integers. The lowest, non-omitted value maps to 0 and the highest, non-omitted value maps to 1. The omitted value is normalized to the next non-omitted value.
Now, I want to implement some compund calculations in the form of:
byte f(byte p1, byte p2, byte p3, byte p4)
return (p1 * p2) / (p3 * p4);
where the byte values are interpreted as the [0, 1] interval mentioned above. This means p1 * p2 < p1 and p1 * p2 < p2 as opposed to numbers greater than 1, where this is not valid, e. g. 2 * 3 = 6, but 0.1 * 0.2 = 0.02.
Additionally, a problem is: p1 * p2 and p3 * p4 may exceed the range of the type byte. The result of the whole formula may not exceed this range, but the overflow would still occur in one or both parts. Of course, I can just cast to ushort and in the end back to byte, but for an ulong I wouldn't have this possibility without further effort and I don't want to stick to 32 bits. On the other hand, if I return (p1 / p3) * (p2 / p4), I decrease the type escalation, but might run into a result of 0, where the actual result is non-zero.
So I thought of somehow simultaneously "shrinking" both products step by step until I have the result in the [0, 1] interpretation. I don't need an exact value, a heuristic with an error less than 3 integer values off the correct value would be sufficient, and for an ulong an even higher error would certainly be OK.
So far, I have tried to convert the input to a decimal/float/double in the interval [0, 1] and calculated it. But this is completely counterproductive regarding performance. I read stuff about division algorithms, but I couldn't find the one I saw once in class. It was about calculating quotient and remainder simultaneously, with an accumulator. I tried to reconstruct and extend it for factorized parts of the division with corrections, but this breaks, where inidivisibility occurs and I get a too big error. I also made some notes and calculated some integer examples manually, trying to factor out, cancel out, split sums and such fancy derivation stuff, but nothing led to a satisfying result or steps for an algorithm.
Is there a
performant way
to multiply/divide signed (and unsigned) integers as above
interpreted as interval [0, 1]
without type promotion
To answer your question as summarised: No.
You need to state (and rank) your overall goals explicitly (e.g., is symmetry more or less important than performance?). Your chances of getting a helpful answer improve with succinctly stating them in the question.
While I think Phil1970's you can ignore scaling for … division overly optimistic, multiplication is enough of a problem: If you don't generate partial results bigger (twice as big) as your "base type", you are stuck with multiplying parts of your operands and piecing the result together.
For ideas about piecing together "larger" results: AVR's Fractional Multiply.
Regarding …in signed integers. The lowest, non-omitted value maps to 0…, I expect that you will find, e.g., excess -32767/32768-coded fractions even harder to handle than two's complement ones.
If you are not careful, you will lost more time doing conversions that it would have take with regular operations.
That being said, an alternative that might make some sense would be to map value between 0 and 128 included (or 0 and 32768 if you want more precision) so that all value are essentially stored multiplied by 128.
So if you have (0.5 * 0.75) / (0.125 * 0.25) the stored values for each of those numbers would be 64, 96, 16 and 32 respectively. If you do those computation using ushort you would have (64 * 96) / (16 * 32) = 6144 / 512 = 12. This would give a result of 12 / 128 = 0.09375.
By the way, you can ignore scaling for addition, substraction and division. For multiplication, you would do the multiplication as usual and then divide by 128. So for 0.5 * 0.75 you would have 64 * 96 / 128 = 48 which correspond to 48 / 128 = 0.375 as expected.
The code can be optimized for the platform particularly if the platform is more efficient with narrow numbers. And if necessary, rounding could be added to operation.
By the way since the scaling if a power of 2, you can use bit shifting for scaling. You might prefer to use 256 instead of 128 particularly if you don't have one cycle bit shifting but then you need larger width to handle some operations.
But you might be able to do some optimization if the most significant bit is not set for example so that you would only use larger width when necessary.

What is the greatest inaccuracy in a range of floats?

Given two float values (fLow and fHigh), how could you calculate the greatest or maximum stride/gap between the two successive values?
For example:
In the range 16777217f to 20000000f the answer would be 2, as values are effectively rounded to the nearest two.
Generalizing this to an arbitrary range has got me scratching my head - any suggestions?
This should be language neutral, but I'm using C# (which conforms to IEEE-754 for this, I think).
This is in C. It requires some IEEE 754 behavior, for rounding and such. For IEEE 754 64-bit binary (double), SmallestPositive is 2-1074, approximately 4.9406564584124654417656879286822137236505980261e-324, and DBL_EPSILON is 2-52, 2.220446049250313080847263336181640625e-16. For 32-bit binary (float), change DBL to FLT and double to float wherever they appear (and fabs to fabsf and fmax to fmaxf, although it should work without these changes). Then SmallestPositive is 2-149, approximately 1.401298464324817070923729583289916131280261941876515771757068283889791e-45, and FLT_EPSILON is 2-23, 1.1920928955078125e-07.
For an interval between two values, the greatest step size is of course the step size at the endpoint with larger magnitude. (If that endpoint is exactly a power of two, the step size from that point to the next does not appear in the interval itself, so that would be a special case.)
#include <float.h>
#include <math.h>
/* Return the ULP of q.
This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
05.12_, Faculty for Information and Communication Sciences, Hamburg
University of Technology, November 13, 2005.
double ULP(double q)
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
return fmax(SmallestPositive, q - (q - q * Scale));
Well, machine accuracy is, as the name indicates, really something that might in general depend on the machine and even on the compiler. So, to be really sure you will typically have to write a program that actually tests what is going on.
However, I suspect that you are really looking for some handy formulas that you can use to approximate the maximum distance in a given interval. The Wikipedia article on machine epsilon gives a really nice overview over this topic and I'm mostly quoting from this source in the following.
Let s be the machine epsilon of your floating point representation (i.e., about 2^(-24) in the case of standard floats), then the maximum spacing between a normalised number x and its neighbors is 2*s*|x|. The word normalised is really crucial here and I will not even try to consider the situation for de-normalised numbers because this is where things get really nasty...
That is, in your particular case the maximum spacing h in the interval you propose is given by h = 2*s*max(|fLow|, |fHigh|).

C# and the mischief of floats

In testing as to why my program is not working as intended, I tried typing the calculations that seem to be failing into the immediate window.
1.0 - correct
200f * 0.005f
Math.Floor(200f * 0.005f)
0.0 - incorrect
(float)(200f * 0.005f)
Math.Floor((float)(200f * 0.005f))
0.0 - incorrect
Probably some float loss is occuring, 0.99963 ≠ 1.00127 for example.
I wouldn't mind storing less pricise values, but in a non lossy way, for example if there were a numeric type that stored values as integers do, but to only three decimal places, if it could be made performant.
I think probably there is a better way of calculating (n * 0.005f) in regards to such errors.
TY, a solution:
Math.Floor(200m * 0.005m)
Also, as I understand it, this would work if I didn't mind changing the 1/200 into 1/256:
Math.Floor(200f * 0.00390625f)
The solution I'm using. It's the closest I can get in my program and seems to work ok:
float x = ...;
UInt16 n = 200;
decimal d = 1m / n;
... = Math.Floor((decimal)x * d)
Floats represent numbers as fractions with powers of two in the denominator. That is, you can exactly represent 1/2, or 3/4, or 19/256. Since .005 is 1/200, and 200 is not a power of two, instead what you get for 0.005f is the closest fraction that has a power of two on the bottom that can fit into a 32 bit float.
Decimals represent numbers as fractions with powers of ten in the denominator. Like floats, they introduce errors when you try to represent numbers that do not fit that pattern. 1m/333m for example, will give you the closest number to 1/333 that has a power of ten as the denominator and 29 or fewer significant digits. Since 0.005 is 5/1000, and that is a power of ten, 0.005m will give you an exact representation. The price you pay is that decimals are much larger and slower than floats.
You should always always always use decimals for financial calculations, never floats.
The problem is that 0.005f is actually 0.004999999888241291046142578125... so less than 0.005. That's the closest float value to 0.005. When you multiply that by 200, you end up with something less than 1.
If you use decimal instead - all the time, not converting from float - you should be fine in this particular scenario. So:
decimal x = 0.005m;
decimal y = 200m;
decimal z = x * y;
Console.WriteLine(z == 1m); // True
However, don't assume that this means decimal has "infinite precision". It's still a floating point type with limited precision - it's just a floating decimal point type, so 0.005 is exactly representable.
If you cannot tolerate any floating point precision issues, use decimal.
Ultimately even decimal has precision issues (it allows for 28-29 significant digits). If you are working in it's supported range ((-7.9 x 10^28 to 7.9 x 10^28) / (100^28)), you are quite unlikely to be impacted by them.

