Floating Point Error Example

7 minutes to read

Last post, we talked about how to understand floating point error, but the only example we consider was the trivial case of \(a + b + c\). In this post, we will consider the example of summing the reciprocal squares of the natural numbers up to \(n\),

\[s_n = \sum_{k=1}^n \frac{1}{k^2_{}},\]

in two different manners, and see how the error bounds we arrive at are different. Most importantly, we will see how different the possible precision is.

We start by writing out the sum as

\[1 + \frac{1}{2^2} + \frac{1}{3^3} + \dots + \frac{1}{(n-1)^2} + \frac{1}{n^2}.\]

The two methods of summation we will consider is summing up the numbers from left to right, and then from right to left.

True Truncation Error

Finding the limit of this summation is known as the Basel problem and was first solved by Euler when he was a young mathematician. The solution to the infinite sum is exactly \(\frac{\pi^2}{6}\). However, we are not interested here in how to compute this. If you are interested in seeing some proofs, I recommend the stackexchange post here.

We are also interested in the analytical error created by stopping out summation early at \(n\) instead of \(\infty\). This error is exactly

\[\sum_{k=n + 1}^\infty \frac{1}{k^2_{}},\]

which we can bound by the integral of the same monotonically decreasing function starting at \(n\), which gives us

\[\sum_{k=n + 1}^\infty \frac{1}{k^2_{}} \leq \int_n^\infty \frac{1}{k^2} \, dk = \frac{1}{n}.\]

Now, we know the analytical error of cutting off the sum early. This will be useful later in calculating how close we can come to the true limit of \(\frac{\pi^2}{6}\) by increasing \(n\). After a certain point, the increase in floating point error will counteract the change in analytical error. That will be our limit of accuracy, and increasing \(n\) will not let our calculation get any more accurate.

Error Analysis

Note: These sections will be very mathy, and if you are just interested in the difference in result and possible accuracy achieved, you can skip to the end where I will list the results.

We will denote the \(k\)th term of the summation (in either direction) as \(a_k\). We will denote the \(n\)th partial sum as \(s_n\). We will number the \(\delta\)’s which appear as either being attached to the term \(a_k\) (for the reciprocal inverse operation) or as being attached to the partial sum \(s_n\). Then, our partial sums \(\hat{s_n}\) have the recursive formula

\[\hat{s_n} = (\hat{s}*{n-1} + a_n(1 + \delta*{a1}))(1 + \delta_{s_n})\]

with the base case

\[\hat{s_2} = (a_1(1 + \delta_{a1}) + a_2(1 + \delta_{a2}))(1 + \delta_{s2}).\]

We are then going to solve for the absolute error, the term

\[| \hat{s_n} - s_n |\]

We subtract out the \(a_i\)s in the true sum, and also ignore terms which multiply together \(\delta\)s. Thus, combining terms for each \(a_i\), we get

\[a_1(1 + n\delta_{a1}) + a_2(1 + n\delta_{a2}) + a_3(1 + (n-1)\delta_{a3}) + \dots + a_n(1 + \delta_{an}).\]

In order to make this easier to work with, we will add an additional \(a_1\delta_{a1}\) to the error, which gives us the nice summation

\[\sum_{k=1}^n a_k(n + 2 - k)\delta_{ak}.\]

Now, we can use this summation to bound the error in each of the two different cases.

Left to Right Analysis

In this case, we have terms

\[a_k = \frac{1}{k^2}\]

which substitutes into the absolute error summation as

\[\sum_{k=1}^n \frac{1}{k^2}(n + 2 - k) \delta_{ak}.\]

We already know that every \(\delta\) is, by definition, of absolute value less than \(\frac{\epsilon}{2}\). Thus, we substitute that into the summation to get that

\[| \hat{s_n} - s_n | \leq \sum_{k=1}^n \frac{\epsilon}{2} \frac{n + 2 - k}{k^2} \leq \frac{\epsilon(n + 2)}{2} \sum_{k=1}^n \frac{1}{k^2}.\]

Above, we also removed the \(-k\) term, as we are trying to upper bound the error, and get a sufficient bound without it. We can then use the limit of the sequence \(s_n\) to state that the sum term is upper bounded by \(\pi / 6 \leq 2\). This gives us an error bound of

\[| \hat{s_n} - s_n | \leq (n + 2) \epsilon.\]

Right to Left Analysis

When summing from right to left, the \(k\)th term is

\[a_k = \frac{1}{(n + 1 - k)^2}.\]

which gives us an absolute error summation of

\[\sum_{k=1}^n \frac{n + 2 - k}{(n + 1 - k)^2} \delta_{ak}\]

which we split into two different sums to make the numerator and denominator cancel. We then see that the right side is the same sum as \(s_n\), and as such can be bounded by \(2 \frac{\epsilon}{2} = \epsilon\).

\[\sum_{k=1}^n \frac{\delta_{ak}}{n + 1 - k} + \sum_{k=1}^n \frac{\delta_{ak}}{(n + 1 - k)^2} \leq \sum_{k=3}^{n+2} \frac{\delta_{ak}}{k} + \epsilon.\]

We then use the \(\frac{\epsilon}{2}\) bound on the sum of \(\delta\)’s. Then, we have the harmonic series starting at 3. The first term is always 1, so we can state that

\[1 + \sum_{k=3}^n \frac{1}{k} \leq \sum_{k=1}^n \frac{1}{k}\]

We can then use a known bound of the harmonic series up to \(k\) terms,

\[\sum_{k=1}^n \frac{1}{k} \leq 1 + \ln_{} n\]

which lets us state that our sum starting at three is bounded by

\[\sum_{k=3}^n \frac{1}{k} \leq \ln n.\]

Putting it all together, we get an error bound of

\[| \hat{s_n} - s_n| \leq \frac{\epsilon}{2}(2 + \ln n).\]

Comparison and Meaning

We now have an error bound for each summing method, listed below.

Method	Error Bound
Left to Right	\(\leq (n + 2) \epsilon\)
Right to Left	\(\leq \frac{(2 + \ln n)\epsilon}{2}\)

Clearly, the order of growth of the error is hugely different. One is linear, and the other is logarithmic in the number of terms. It’s crazy to see this difference in error bounds on a simple summation, with such simple methods each way.

Now that we see the difference in error bound, how does this affect how close this method can get to estimating the true limit of \(\pi^2 / 6\)? Clearly, neither can approximate it perfectly. However, how close does each get? And how many terms does it take?

The way to solve this is to write an expression for our total error, in this case the total error is the sum of the floating point error and truncation error. Then, we will minimize this expression.

\[\text{Total Error} = \text{ Truncation Error } + \text{ Floating Point Error}.\]

In both cases, the truncation error is just \(\frac{1}{n}\), as we proved above.

Left to Right

The total error is

\[\frac{1}{n} + (n + 2)\epsilon.\]

We take the derivative with respect to \(n\), and set it equal to 0.

\[-\frac{1}{n^2} + \epsilon = 0 \implies n = \frac{1}{\sqrt{\epsilon}}\]

which will give us a maximum precision of

\[\sqrt{\epsilon} + 2\epsilon + \frac{1}{\sqrt{2\epsilon}} \approx 2^{25}\epsilon.\]

Right to Left

The total error is

\[\frac{1}{n} + \epsilon + \frac{\ln n}{2}\epsilon.\]

We take the derivative, and set it equal to 0.

\[-\frac{1}{n^2} + \frac{\epsilon}{2n} = 0 \implies n = \frac{2}{\epsilon}\]

which gives us a maximum precision of

\[\frac{3\epsilon}{2} + \frac{\ln \frac{2}{\epsilon}}{2}\epsilon \approx 19\epsilon\]

The Difference

We see that in this case, the best accuracy achievable differs by more than 6 orders of magnitude, or \(2^{21}\). This is an absolutely massive difference. Of course, to achieve this level of accuracy, we have to run the program for similarly absurd numbers of iterations. For the left to right case, we need to sum about 67 million terms before our accuracy stops increasing (albeit slower than the right to left case). For the right to left, we get increasing accuracy all the way up to about 1 quadrillion terms. We will never want to evaluate this sum that far out, but it is a useful example to showcase the difference in error that we can get just by changing the order in which we do a simple sum.

Overall, floating point error is generally very small, and usually negligible, relative to the size of your problem or the tolerance of a solution. However, it’s worth knowing how it works, what can help to make it less pronounced, and general floating point best practices. Of course, you can always just use 128-bit if you are concerned it will be an issue, and then your (floating point) errors will almost disappear.

Nate Armstrong