Floating Point Representation

Table of contents
  1. Floating Point Structure
  2. Single Precision
    1. Single Precision Example

Floating Point Structure

IEEE 754 Floating Point Standard

Single Precision

A decimal number $x$ is broken down in single precision as:

$$ x = (-1)^\boldsymbol{S} \times (1.0 + \boldsymbol{F}) \times 2^{\boldsymbol{E} - 127} $$

  • $\boldsymbol{S}$ is the sign: $0$ for positive, $1$ for negative
  • $\boldsymbol{E}$ is the exponent: $1 \leq \boldsymbol{E} \leq 254$
  • $\boldsymbol{F}$ is the fraction: $0 \leq \boldsymbol{F} < 1$

The values of $\boldsymbol{S}$, $\boldsymbol{E}$, and $\boldsymbol{F}$ need to be converted to binaries of 1, 8, and 23 bits respectively, and concatenated to form a 32-bit binary number.

\[\begin{gather*} \boldsymbol{S} := \texttt{s} \\[1em] \boldsymbol{E} := \texttt{e}_7 \dots \texttt{e}_0 \\[1em] \boldsymbol{F} := \texttt{f}_1 \dots \texttt{f}_{23} \end{gather*}\]
Fractions

Remeber that $\boldsymbol{F}$ is a fraction, and its binary representation is obtained by finding $\texttt{f}_i \in \{0, 1\}$ such that:

\[\boldsymbol{F} = \sum_{i=1}^{23} \frac{\texttt{f}_i}{2^i}\]

where $\texttt{f}_1$ is the leftmost bit of $\boldsymbol{F}$.

Which is the reverse direction of $\boldsymbol{E}$:

\[\boldsymbol{E} = \sum_{i=0}^{7} \texttt{e}_i 2^i\]

Note how the indexings differ.

Single Precision Example

To find the single precision representation of $x = 3.125$:

$\texttt{s = 0}$ because $x$ is positive.

Now convert each side of the decimal point ($3$ and $0.125$) to binary:

\[\begin{gather*} (3.125)_{10} = (11.001)_2 \\[0.5em] \Downarrow \\[0.5em] 1.1001 \times 2^1 \\[0.5em] \Downarrow \\[0.5em] 1.1001 \times 2^{128 - 127} \end{gather*}\]

Hence $\boldsymbol{E} = 128 = \texttt{10000000}$ and $\boldsymbol{F} = \texttt{10010000000000000000000}$.

The single precision representation of $x$ is:

\[\texttt{01000000010010000000000000000000}\]