Floating Point Representation
Table of contents
Floating Point Structure
Single Precision
A decimal number $x$ is broken down in single precision as:
$$ x = (-1)^\boldsymbol{S} \times (1.0 + \boldsymbol{F}) \times 2^{\boldsymbol{E} - 127} $$
- $\boldsymbol{S}$ is the sign: $0$ for positive, $1$ for negative
- $\boldsymbol{E}$ is the exponent: $1 \leq \boldsymbol{E} \leq 254$
- $\boldsymbol{F}$ is the fraction: $0 \leq \boldsymbol{F} < 1$
The values of $\boldsymbol{S}$, $\boldsymbol{E}$, and $\boldsymbol{F}$ need to be converted to binaries of 1, 8, and 23 bits respectively, and concatenated to form a 32-bit binary number.
\[\begin{gather*} \boldsymbol{S} := \texttt{s} \\[1em] \boldsymbol{E} := \texttt{e}_7 \dots \texttt{e}_0 \\[1em] \boldsymbol{F} := \texttt{f}_1 \dots \texttt{f}_{23} \end{gather*}\]Fractions
Remeber that $\boldsymbol{F}$ is a fraction, and its binary representation is obtained by finding $\texttt{f}_i \in \{0, 1\}$ such that:
\[\boldsymbol{F} = \sum_{i=1}^{23} \frac{\texttt{f}_i}{2^i}\]where $\texttt{f}_1$ is the leftmost bit of $\boldsymbol{F}$.
Which is the reverse direction of $\boldsymbol{E}$:
\[\boldsymbol{E} = \sum_{i=0}^{7} \texttt{e}_i 2^i\]Note how the indexings differ.
Single Precision Example
To find the single precision representation of $x = 3.125$:
$\texttt{s = 0}$ because $x$ is positive.
Now convert each side of the decimal point ($3$ and $0.125$) to binary:
\[\begin{gather*} (3.125)_{10} = (11.001)_2 \\[0.5em] \Downarrow \\[0.5em] 1.1001 \times 2^1 \\[0.5em] \Downarrow \\[0.5em] 1.1001 \times 2^{128 - 127} \end{gather*}\]Hence $\boldsymbol{E} = 128 = \texttt{10000000}$ and $\boldsymbol{F} = \texttt{10010000000000000000000}$.
The single precision representation of $x$ is:
\[\texttt{01000000010010000000000000000000}\]