In the tutoring session, I was asked to swap the exponent part of a (32-bit) floating point number to be a new value without any arithmetic operations. That is, let be a floating number in the binary format like the following:
; we want to change the exponent to be some new value . The result of this swapping is
In the IEEE standard, a floating number is represented as
We can use a bit mask to extract the exponent part of :
cppcppunion float_bits { float f; unsigned u; }; float swap_exp(float f, int8_t exp) { union float_bits x; x.f = f; uint32_t mask = 0x7f800000; uint32_t old_exp = x.u & mask; // more stuff needs to be done here // so that we can swap the old_exp // to be the exp return x.f; }
union float_bits { float f; unsigned u; }; float swap_exp(float f, int8_t exp) { union float_bits x; x.f = f; uint32_t mask = 0x7f800000; uint32_t old_exp = x.u & mask; // more stuff needs to be done here // so that we can swap the old_exp // to be the exp return x.f; }
Recall that the exponent of the binary representation of a floating number needs to be offset by a bias number before it can be the real exponent part in the floating point bit representation. In this case, the bias number is .
We can simply take the previous code and add the bias number to the new exponent to get the new exponent part of the floating number.
cppcppunion float_bits { float f; unsigned u; }; float swap_exp(float f, int8_t exp) { union float_bits x; x.f = f; uint32_t mask = 0x7f800000; uint32_t old_exp = x.u & mask; uint32_t new_exp = exp + 127; x.u = (x.u & ~old_exp) | (new_exp << 23); return x.f; }
union float_bits { float f; unsigned u; }; float swap_exp(float f, int8_t exp) { union float_bits x; x.f = f; uint32_t mask = 0x7f800000; uint32_t old_exp = x.u & mask; uint32_t new_exp = exp + 127; x.u = (x.u & ~old_exp) | (new_exp << 23); return x.f; }
But NO ARITHMETIC OPERATIONS are allowed. So we can’t just add the exp
with the bias number. We need something trickier here. Ryan and I thought of a solution that uses only bit operations.
For any positive added with the biased number is equal to . That is,
Therefore, if we know what is, we can |
it with the mask 0x1000_0000
.
So how do we compute without arithmetic operations? Well,
cppcppuint32_t sub_one(uint32_t n) { uint32_t i; for (i = 1; i != (i & n) ; n |= i, i <<= 1) ; return n & ~i; }
uint32_t sub_one(uint32_t n) { uint32_t i; for (i = 1; i != (i & n) ; n |= i, i <<= 1) ; return n & ~i; }
When we subtract one from a binary, if the th place is , we need to burrow from higher places and the th place with turn to . For example, if we want to subtract one from , we need to borrow from the $1$st place and the result will be . This means, when we subtract one from a binary number, we substitute all zeros we see to be before we see any ; then we substitute the first to be . The for
loop above does exactly this. It stops when it finds the first but before that it substitutes any in n
to be . After the for loop, we set the first we saw to be . Then we subtracted !
For any negative , we can get the real exponent bits by
So the final code is
cppcppunion float_bits { float f; unsigned u; }; uint32_t sub_one(uint32_t n) { uint32_t i; for (i = 1; i != (i & n) ; n |= i, i <<= 1) ; return n & ~i; } float swap_exp(float f, int8_t exp) { union float_bits x; x.f = f; uint32_t mask = 0x7f800000; uint32_t old_exp = x.u & mask; uint32_t new_exp = exp == 0 ? 127 : ( exp > 0 ? (0b10000000 | sub_one(exp)) : sub_one(0b01111111 & exp) ); x.u = (x.u & ~old_exp) | (new_exp << 23); return x.f; } int main() { float f = 16.0; // 1.0 * 2^4 printf( "The output is %.5f and the expected output is 32.0\n", swap_exp(f, 5) ); printf( "The output is %.5f and the expected output is 0.5\n", swap_exp(f, -1) ); printf( "The output is %.5f and the expected output is 1.0\n", swap_exp(f, 0) ); return 0; }
union float_bits { float f; unsigned u; }; uint32_t sub_one(uint32_t n) { uint32_t i; for (i = 1; i != (i & n) ; n |= i, i <<= 1) ; return n & ~i; } float swap_exp(float f, int8_t exp) { union float_bits x; x.f = f; uint32_t mask = 0x7f800000; uint32_t old_exp = x.u & mask; uint32_t new_exp = exp == 0 ? 127 : ( exp > 0 ? (0b10000000 | sub_one(exp)) : sub_one(0b01111111 & exp) ); x.u = (x.u & ~old_exp) | (new_exp << 23); return x.f; } int main() { float f = 16.0; // 1.0 * 2^4 printf( "The output is %.5f and the expected output is 32.0\n", swap_exp(f, 5) ); printf( "The output is %.5f and the expected output is 0.5\n", swap_exp(f, -1) ); printf( "The output is %.5f and the expected output is 1.0\n", swap_exp(f, 0) ); return 0; }
There are inf
and nan
but we are not going to deal with them here.
The code format is not working very well. I will fix it later! :)