Understanding Floating Point Error

11 minutes to read

Floating point is the most commonly used number format for calculations. However, most people don’t know how error is created and propagated in floating point calculations. In this post, I go over the basics of understanding floating point rounding and error bounds.

Floating Point

There are many explanations of floating point on the internet, and so I will not write a detailed explanation here. If you do not know what floating point is, I recommend reading the article here(TODO). For our purposes, we are interested in the IEEE 754 64 bit floating point representation. I ignore the sign bit, as it is not relevant to error calculations. Below, I list the breakdown of bits for these representations.

Mantissa Length	Characteristic Length
52	11

We denote the mantissa by $m$ of length $l_m$, the characteristic by $c$ of length $l_c$, and the sign bit as $s$. The characteristic is biased, which means that the true exponent is given by $c - 2^{l_c - 1} + 1$, giving us $2^{l_c - 1}$ positive characteristics, and $2^{l_c - 1} - 1$ negative characteristics. Any number with $c \neq 0$ and $m \neq 0$ will be represented as

\[(-1)^s \left(1 + \frac{m}{2^{-52}}\right) \cdot 2^{c - 1023}\]

For example, if we had the number 5.25, we can convert this to floating point. We first see that this number is between $4=2^2$ and $8 = 2^3$, and so $c - 1023 = 2 \implies c = 1025$. Then, we see that

\[5.25 = 4 + 1 + .25 = 2^2 + 2^0 + 2^{-2}\]

which gives us the mantissa.

Sign	Mantissa	Characteristic
1	$\underbrace{0101000\dots}_\text{5$2_{}$ digits}$	10000000001

Machine Epsilon

We all know that with only 64 bits to represent a number, we can only represent a finite amount of numbers. In fact, it will certainly be at most $2^{64}$ numbers. How far apart are these numbers with the floating point standard?

In fact, it depends on the characteristic of the floating point number. For every exponent $-1022 \leq k \leq 1022$, there are exactly $2^{52}$ numbers between $2^k$ and $2^{k+1}$, one for each possible mantissa. Note that we are ignoring the largest and smallest exponents here, as those correspond to special floating point numbers.

First, we will consider the numbers $1 \leq x \leq 2$. The smallest number we can represent in this range is exactly 1, by letting the characteristic equal 1024 and the mantissa be 0. The next largest number we can represent in this range is $1 + 2^{-52}$, by letting the characteristic equal 1024 and the mantissa be all 0s except for a 1 at the end. Thus, the difference between these numbers is $\epsilon = 2^{-52}$. Clearly, successive numbers in this range differ by 1 in the mantissa, which is $\epsilon$ on the real number line. We call $\epsilon$ machine epsilon, to refer to the smallest gap between numbers that the computer can represent in this number format. However, we have not yet understood if $\epsilon$ is the smallest gap, or if it is relative to the number represented.

Then, we consider the numbers between 2 and 4. We still have a distance of 1 in the mantissa between successive numbers. However, now this corresponds to $2^{-51}$, or $2\epsilon$. Between 4 and 8, the distance is $4\epsilon$. Between 1/2 and 1, the distance is only $\epsilon / 2$ By now, the trend should be fairly clear. The distance between successive numbers is relative to the size of the number. More precisely, the distance from a number $x$ to the next floating point number is exactly

\[2^{\lfloor \log_2(x) \rfloor} \epsilon.\]

Rounding in Floating Point

We need to introduce some new notation in order to differentiate between a number, $x$, and its floating point representation, $\textbf{fl}(x)$. The representation of a number $x$ in floating point is just the rounded representation of it. We round just as we would with an ordinary number.¹ The key point of floating point error is understanding the difference between $x$ and $\textbf{fl}(x)$, and how this difference propagates through floating point operations.

For example, if we have $x = 1 + \frac{\epsilon}{4}$, we round this to $\textbf{fl}(x) = 1$. On the other hand, $\textbf{fl}\left(1 + \frac{3\epsilon}{4}\right) = 1 + \epsilon$.

Rounding Error in Floating Point

We know that rounding will introduce some error. We bring a range of numbers to another, simpler number. How do we quantify this?

Let’s consider rounding $x$ between 1 and 2. Then, the distance between consecutive floating point numbers is $\epsilon$. At worst, a point $x$ will be $\frac{\epsilon}{2}$ from $\textbf{fl}(x)$. Mathematically, this becomes

\[|x - \textbf{fl}(x)| \leq \frac{\epsilon}{2}\]

In fact, we can divide by $\mid x \mid$ on the left side, as we know that $\mid x\mid$ is between 1 and 2. The resultant expression is true for all $x$, as the floating point error scales with $\mid x\mid$.²

\[\frac{|x - \textbf{fl}(x)|}{|x|} \leq \frac{\epsilon}{2}\]

When we want to talk about the error and propagate it, we want a better solution than keeping an awful inequality for every single arithmetic operation. Thus, we state that $x - \textbf{fl}(x) = \delta x$ for some $-\frac{\epsilon}{2} \leq \delta \leq \frac{\epsilon}{2}$. This can also be written as $\textbf{fl}(x) = x(1 + \delta)$. Then, we have an exact expression for $x$ that makes understanding and propagating these errors through complex floating point expressions and algorithms much more understandable and practical.

Now that we understand how to quantify floating point error when rounding, we need to understand how to apply it to floating point operations.

Error in Floating Point Operations

Most computers nowadays have dedicated circuits devoted to solving basic floating point operations. However, in order to understand the error of a floating point operation, we don’t need to understand how these circuits work. We treat the floating point operation as a black box, and assume that the result of a floating point operation is the exact result correctly rounded.

To understand what this means, let’s assume that our number representation method can only handle integers, and we’ve implemented division as one of our operations. Clearly, the result of $4 / 3$ is not going to be an integer, as it is 1.33. Thus, we assume that our division operation will find the correct result (1.33) and then round it for a result of 1. We do this even though we cannot represent 1.33 in this system.

Multiple Inputs

Every basic arithmetic operation is considered to be done in a single floating point operation. However, it is important to note that the floating point operations can only operate on two inputs at a time, and they are not associative. Thus, the mathematical expression $a + b + c$ does not translate to a unique floating point representation. It could be either $\textbf{fl}(\textbf{fl}(a + b) + c)$ or $\textbf{fl}(a + \textbf{fl}(b + c))$ .

When we look at the actual result of the two operations above, they are not the same. Also, when we write out the error for multiple operations, we need to keep track of multiple $\delta$s, as defined above. Thus, we number them as we use them.

Representation	Result after the first operation	Final Result
$\textbf{fl}(\textbf{fl}(a + b) + c)$	$\textbf{fl}((a + b)(1 + \delta_1) + c)$	$((a + b)(1 + \delta_1) + c)(1 + \delta_2)$
$\textbf{fl}(a + \textbf{fl}(b + c))$	$\textbf{fl}(a + (b + c)(1 + \delta_3))$	$(a + (b + c)(1 + \delta_3))(1 + \delta_4)$

When we factor out the above expressions to isolate the $a$’s, $b$’s, and $c$’s, we get

\[a(1 + \delta_1)(1 + \delta_2) + b(1 + \delta_1)(1 + \delta_2) + c(1 + \delta_2)\]

and

\[a(1 + \delta_4) + b(1 + \delta_3)(1 + \delta_4) + c(1 + \delta_3)(1 + \delta_4).\]

First, we simplify this by rewriting $\delta_i + \delta_j$ as 2 multiplied by some other $\delta$, equal to the average of the two. Then, we see that terms which are the multiples of two $\delta$s are at most absolute value $\epsilon^2 / 4$, which is small enough to ignore. Thus, we get

\[a(1 + 2\delta_5) + b(1 + 2\delta_5) + c(1 + \delta_2)\]

and

\[a(1 + \delta_4) + b(1 + 2\delta_6) + c(1 + 2\delta_6).\]

We can see that the possible error on each input scales linearly with the number of operations that it is in. We can also see that the possible error is different for each input. For example, if $a = 2^{51}$, and $b = c = 1$, the absolute error bounds would be

Expression	Error	Absolute Error Bound
$\textbf{fl}(\textbf{fl}(a + b) + c)$	$a(1 + 2\delta_5) + b(1 + 2\delta_5) + c(1 + \delta_2)$	$2 + \frac{3\epsilon }{2}$
$\textbf{fl}(a + \textbf{fl}(b + c))$	$a(1 + \delta_4) + b(1 + 2\delta_6) + c(1 + 2\delta_6)$	$1 + 2\epsilon$

which are very different. The error in the same operation becomes very nearly doubled by the simple change in order of operations! Of course, this was an extreme example to demonstrate a point, but we will show something more realistic later in a later post. In general, we want the intermediate sums to be as small as possible. An easy heuristic for this is to add the smallest numbers first, and then the largest.

For the rest of the post, we assume that we did the calculation in the first way, $((a + b) + c$. Next, we have to decide how we want to interpret this error.

Forward vs Backwards Error Analysis

There are two distinct methods with which to calculate the error of an expression $\textbf{fl}(f(a, b, c))$

Forwards Error Analysis: We consider the relative error between the exact result and the floating point result. This gives us an error expression of
\[\frac{\mid \textbf{fl}(a + b + c) - (a + b + c) \mid }{ \mid a + b + c \mid}\]
Backwards Error Analysis: We consider the result as being the exact correct operation done to almost the correct inputs. In this case, we get $\textbf{fl}(a + b + c) = \hat{a} + \hat{b} + \hat{c}$ where $\hat{a} = a(1 + 2\delta_4)$ and so on. Then, we bound how close these inputs are to the actual inputs. In this case, we would say that
\[\left| \frac{\hat{a} - a}{a} \right| \leq \epsilon.\]

In this case, it turns out that forward error analysis gives extremely weak bounds. We can construct special inputs to make the relative forward error bound as bad as we want. The relative error bound will simplify to

\[\frac{2 |a + b| + |c|}{| a + b + c|} \frac{\epsilon}{2}\]

which grows very large if we choose some very small number $x$, and then let $a + b = 1$, and $c = - 1 + x$. Then, our error bound becomes

\[\frac{3 - x}{x} \frac{\epsilon}{2}\]

which goes to infinity as we let $x \to 0$. Thus, this relative error bound is not useful.

Therefore, it is better to use backwards error analysis for this example. In fact, this is true in general for analyzing floating point error.

The standard for breaking ties in floating point rounding is to round such that the last bit of the result is 0. Clearly, between any two numbers, one will have a last bit 0. This is a different practice than the standard rules of breaking ties by rounding up or down. If we consider rounding up or down, we can see that this would introduce a bias to our floating point calculations: The floating point result of any operation will be less than or equal to the true result if we only round up (or vice versa if we rounded up). By rounding such that the last bit is 0, the symmetry in that operation to that makes this an unbiased operation over random numbers. ↩
This is slightly imprecise, but you can convince yourself of this by considering the possible errors right around the ‘breakpoints’ of $x=1$ and $x=2$. Then, you can see how this extends to the general breakpoints present at powers of 2. ↩

Representation	Result after the first operation	Final Result
\(\textbf{fl}(\textbf{fl}(a + b) + c)\)	\(\textbf{fl}((a + b)(1 + \delta_1) + c)\)	\(((a + b)(1 + \delta_1) + c)(1 + \delta_2)\)
\(\textbf{fl}(a + \textbf{fl}(b + c))\)	\(\textbf{fl}(a + (b + c)(1 + \delta_3))\)	\((a + (b + c)(1 + \delta_3))(1 + \delta_4)\)

Expression	Error	Absolute Error Bound
\(\textbf{fl}(\textbf{fl}(a + b) + c)\)	\(a(1 + 2\delta_5) + b(1 + 2\delta_5) + c(1 + \delta_2)\)	\(2 + \frac{3\epsilon }{2}\)
\(\textbf{fl}(a + \textbf{fl}(b + c))\)	\(a(1 + \delta_4) + b(1 + 2\delta_6) + c(1 + 2\delta_6)\)	\(1 + 2\epsilon\)

Nate Armstrong