The binary storage of floating point numbers

Floating point numbers are positive or negative numbers with a decimal fraction: numbers such as 1.0 or 17.11 or –3.12 are all floating point numbers.
On the Arduino and other microprocessors are stored in ‘floats’.
Whereas the concept of a byte or an integer is quite straightforward, a float in binary form is a bit more challenging.
A byte is simple, it is just 2⁷+2⁶+2⁵+2⁴+2³ + 2²+2¹ +2⁰.

An integer is similar, be it that it extends to 215.

But how do you store a number like –1.5 ?

Well, floats are stored on the concept that practically every number can be expressed as the multiplication of a power of 2 times  a number between 1 and 2. Take for instance 7, or 7.00 for that matter, that can be expressed as 4 * 1.75 (or 2² * 1.75).

the number  -1.5 is in fact 1 * 1.5 which is 2⁰ * 1.5, preceded by a ‘-‘ sign.
The same goes for higher numbers, say 20.5= 16 * 1.28125  (=2⁴ * 1.28125).
Sure 20.5 can also be expressed by 2*10.25, but the last number must be between 1 and 2.

So in fact every number can be represented by:
sign * 2x  * y  (y being a number between 1 and 2).

According to agreement we call ‘x’ the ‘exponent’ and ‘y’  the ‘mantissa’ though the word mantissa is also used for the fractional part of a logarithm. The IEEE standard for floating point numbers therefore encourages to use the word ‘fraction’  instead of ‘mantissa’, so we can write the above as:
floating point number= sign * 2exponent * fraction

Lets get back to the  fraction part of the number 20.5 which is 1.28125 (remember? 20.5=2⁴ * 1.28125). If we look at that a bit deeper, we can see that that is actually 1+ 1/4 + 1/32. That makes sense coz 16* (1+1/4+1/32)= 16+4+0.5=20.5.
If we would break this down again we can see that  the ‘fraction’ or mantissa is actually a sum of  fractions that all are  again 1/(a power of 2). It is probably clear by now that for 20.5 that would be 1/2⁰ + 1/2² +1/2⁵

Anyway, back to the binary storage.
As said, on the Arduino and many other processors, the floating number is stored in 32 bits and the protocol to store that follows from the notation we have learned above.
The most left bit, bit 32, stores the ‘sign’ if it is a ‘1’  the number is negative, if it is  a ‘0’ it is positive.
The next 8 bits, bits 31-24 store the exponent. as we want  values between 2128  and 2-127, we store 2¹ as 10000000 (decimal 128), 2² as 10000001 (decimal 129), 2³ as  10000010 (decimal 130)  etc… The exponent thus follows from subtracting 127 from the decimal number that is stored in bits 31-24.

The fraction or mantissa is stored in bits 23-1. However, since we know that the fraction is always between 1 an 2, we do not store  the ‘1’ as we know it is always there. We refer to that as the ‘hidden’ bit, although it is not  hidden, it is just not stored. We use bits 23-1 to indicate a sum of  the fractions 1/2, 1/4, 1/8, 1/16 etc.
So, the binary storage of a float is as follows:

sign exponent hidden fraction
20.5 0 10000011 01001000000000000000000
+ 4(131-127) 1+ 1/4 +1/32
-7 1 10000001 11000000000000000000000
2(129-127) 1+ 1/2+1/4

2 thoughts on “The binary storage of floating point numbers

    1. Hugh of course you are right, I must have had a black out. Too much in a hurry to get to the 32 bits float Tnx. It is corrected

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s