Floating-Point Arithmetic
How real numbers are squeezed into 64 bits, and what surprises come with the squeeze
Almost every number you compute in this course will be stored as a 64-bit IEEE 754 double-precision float. The next four pages build the numerical foundation of the course on top of that choice. This first page covers the format itself, the operations defined on it, and the most common ways those operations bite working scientists.
The format
A 64-bit double is a triplet of bit fields:
| Field | Bits | Meaning |
|---|---|---|
| Sign | 1 | for positive, for negative |
| Exponent | 11 | Biased integer, |
| Mantissa (a.k.a. fraction) | 52 | Binary fraction, |
The value represented is
for ordinary "normal" numbers. There are also subnormals
(extremely small numbers near zero), infinities (±inf), and
NaNs (not-a-number values).
You can pry the bits open and look:
A few things to notice:
1.0has exponent (the bias) and mantissa — it is literally .2.0differs from1.0in one bit: the exponent.0.1has a non-zero mantissa, because cannot be written exactly in binary — it is a repeating binary fraction.
Machine epsilon
The smallest float strictly greater than is where . This is called the machine epsilon and it is the most important constant in numerical analysis.
When you read a number off a print(x), you can usually trust the
first 15 to 16 decimal digits. Beyond that, the value is noise.
Special values
Three special "numbers" appear all the time:
+inf/-inf— produced by1.0 / 0.0or by overflownan— not a number, produced by indeterminate forms like0.0 / 0.0orinf - inf-0.0— yes, negative zero, distinguishable from+0.0in a few rare cases
NaN is special because it is not equal to anything, including
itself. This is a deliberate design choice: it lets you check
whether a computation went wrong with x != x, and it makes any
chained calculation that touches a NaN end in NaN (rather than
silently producing a finite garbage result).
Always check for NaN at the boundary
A common source of mysterious bugs is a NaN that sneaks into your
data from a missing value, a log(0), or a sqrt(-1). Use
np.isnan(x).any() after loading data or after any operation that
could produce one, and decide explicitly what to do.
How rounding works
When an exact arithmetic result is not representable as a float, IEEE 754 specifies round-to-nearest, ties-to-even (also called banker's rounding). Halfway cases round to the float whose last mantissa bit is zero. This eliminates the systematic upward bias that plain "round half up" would introduce.
Comparing floats
You should almost never compare floats with ==. Use a tolerance
instead. NumPy ships a helper that lets you specify both absolute
and relative tolerance:
The rule of thumb: rtol should be a few times machine epsilon if
you expect a "good" computation, and looser if your inputs come from
measurements with their own uncertainty.
Subtraction is where precision dies
We have already met catastrophic cancellation in the story chapters. It is worth restating it as a numerical principle:
If two floats and agree in their first leading bits, then has only bits of precision left.
If is large (the numbers are close), the subtraction throws away most of the information.
The classic example: a midpoint approximation of the derivative.
Truncation error shrinks like at first — you can see the error drop by a factor of for each factor-of-10 reduction in . But around , round-off takes over and the error starts to grow. The optimum is somewhere near .
This is one of the most important pictures in numerical analysis, and we will return to it in Approximation and Error.
Overflow, underflow, and dynamic range
A float can hold values from roughly to before underflow/overflow. Inside that range, arithmetic is reliable. Outside, things go wrong silently.
The classic trap: computing — the log-sum-exp
that appears all over physics and machine learning. If any is
around , then overflows to inf. The trick is to
subtract the maximum first.
The naive answer is inf. The stable answer is around .
Same equation, very different floating-point behavior.
A short practice problem
The textbook formula for variance,
$$\mathrm{Var}(x) = \frac{1}{n} \sum_i x_i^2 ; - ; \left( \frac{1}{n} \sum_i x_i \right)^2$$
is numerically disastrous for data with large mean and small spread, because the two terms become nearly equal and cancel.
Implement stable_variance(xs) that computes the population
variance using the two-pass algorithm
$$\bar{x} = \frac{1}{n}\sum_i x_i, \qquad \mathrm{Var}(x) = \frac{1}{n} \sum_i (x_i - \bar{x})^2$$
The naive implementation is provided as naive_variance so you
can see what it returns.
Check your understanding
What does the bit pattern , , represent in IEEE 754 double precision?
Why do experienced numerical programmers usually avoid a == b for floating-point comparison?
The == operator is slow
Arithmetic that should produce equal values often differs by a few ULPs (units in the last place), so two values that are "morally" equal compare false
== raises an exception on nan
Python compares floats by string representation