In the tutoring session, I was asked to swap the exponent part of a (32-bit) floating point number to be a new value without any arithmetic operations. That is, let xx be a floating number in the binary format like the following:

1.×2m1.\dots\times 2^{\texttt{m}}

; we want to change the exponent mm to be some new value nn. The result of this swapping is

1.×2n1.\dots\times 2^{\texttt{n}}

In the IEEE standard, a floating number is represented as

x=_Sign________Exponent_______________________Significandx = \underbrace{\texttt{\_}}_{\texttt{Sign}}\underbrace{\texttt{\_\_\_\_\_\_\_\_}}_{\texttt{Exponent}}\underbrace{\texttt{\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_}}_{\texttt{Significand}}

We can use a bit mask to extract the exponent part of xx:

cpp
union float_bits {
    float f;
    unsigned u;
};

float swap_exp(float f, int8_t exp) {
    union float_bits x;
    x.f = f;
    uint32_t mask = 0x7f800000;
    uint32_t old_exp = x.u & mask;

    // more stuff needs to be done here
    // so that we can swap the old_exp 
    // to be the exp 

    return x.f;
}

Recall that the exponent of the binary representation of a floating number needs to be offset by a bias number before it can be the real exponent part in the floating point bit representation. In this case, the bias number is 127127.

We can simply take the previous code and add the bias number to the new exponent nn to get the new exponent part of the floating number.

cpp
union float_bits {
    float f;
    unsigned u;
};

float swap_exp(float f, int8_t exp) {
    union float_bits x;
    x.f = f;
    uint32_t mask = 0x7f800000;
    uint32_t old_exp = x.u & mask;
    uint32_t new_exp = exp + 127;

    x.u = (x.u & ~old_exp) | (new_exp << 23); 

    return x.f;
}

But NO ARITHMETIC OPERATIONS are allowed. So we can’t just add the exp with the bias number. We need something trickier here. Ryan and I thought of a solution that uses only bit operations.

For any positive nn added with the biased number 127127 is equal to 128+n1128 + n - 1. That is,

0x0111_1111+n=0x1000_0000+n1=0x1000_0000 | (n - 1) \begin{aligned} \texttt{0x0111\_1111} + n &= \texttt{0x1000\_0000} + n - 1 \\ &= \texttt{0x1000\_0000 | (n - 1) } \end{aligned}

Therefore, if we know what n1n - 1 is, we can | it with the mask 0x1000_0000.

So how do we compute n1n-1 without arithmetic operations? Well,

cpp
uint32_t sub_one(uint32_t n) {
    uint32_t i;
    for (i = 1; i != (i & n) ; n |= i, i <<= 1) ;
    return n & ~i; 
}

When we subtract one from a binary, if the 00 th place is 00, we need to burrow from higher places and the 00 th place with turn to 11. For example, if we want to subtract one from 10101010, we need to borrow from the $1$st place and the result will be 10011001. This means, when we subtract one from a binary number, we substitute all zeros we see to be 11 before we see any 11; then we substitute the first 11 to be 00. The for loop above does exactly this. It stops when it finds the first 11 but before that it substitutes any 00 in n to be 11. After the for loop, we set the first 11 we saw to be 00. Then we subtracted 11!

For any negative nn, we can get the real exponent bits by

(0x0111_1111 & n)1(\texttt{0x0111\_1111 \& n}) - 1

So the final code is

cpp
union float_bits {
    float f;
    unsigned u;
};

uint32_t sub_one(uint32_t n) {
    uint32_t i;
    for (i = 1; i != (i & n) ; n |= i, i <<= 1) ;
    return n & ~i; 
}

float swap_exp(float f, int8_t exp) {
    union float_bits x;
    x.f = f;
    uint32_t mask = 0x7f800000;
    uint32_t old_exp = x.u & mask;
    uint32_t new_exp = exp == 0 ? 127 : 
                (
                    exp > 0 ? 
                    (0b10000000 | sub_one(exp)) : 
                    sub_one(0b01111111 & exp)
                );

    x.u = (x.u & ~old_exp) | (new_exp << 23); 

    return x.f;
}

int main() {
    float f = 16.0;  // 1.0 * 2^4
    printf(
        "The output is %.5f and the expected output is 32.0\n", 
        swap_exp(f, 5)
    );
    printf(
        "The output is %.5f and the expected output is 0.5\n", 
        swap_exp(f, -1)
    );
    printf(
        "The output is %.5f and the expected output is 1.0\n", 
        swap_exp(f, 0)
    );
    return 0;
}

There are inf and nan but we are not going to deal with them here.

The code format is not working very well. I will fix it later! :)