Floating-Point Arithmetic

How real numbers are squeezed into 64 bits, and what surprises come with the squeeze

Almost every number you compute in this course will be stored as a 64-bit IEEE 754 double-precision float. The next four pages build the numerical foundation of the course on top of that choice. This first page covers the format itself, the operations defined on it, and the most common ways those operations bite working scientists.

The format

A 64-bit double is a triplet of bit fields:

Field	Bits	Meaning
Sign $s$	1	$0$ for positive, $1$ for negative
Exponent $e$	11	Biased integer, $0 \le e \le 2047$
Mantissa (a.k.a. fraction) $m$	52	Binary fraction, $0 \le m < 1$

The value represented is

$(-1)^s \cdot (1 + m) \cdot 2^{e - 1023}$

for ordinary "normal" numbers. There are also subnormals (extremely small numbers near zero), infinities (±inf), and NaNs (not-a-number values).

You can pry the bits open and look:

A few things to notice:

1.0 has exponent $1023$ (the bias) and mantissa $0$ — it is literally $1 \cdot 2^0$ .
2.0 differs from 1.0 in one bit: the exponent.
0.1 has a non-zero mantissa, because $0.1$ cannot be written exactly in binary — it is a repeating binary fraction.

Machine epsilon

The smallest float strictly greater than $1.0$ is $1 + \epsilon$ where $\epsilon = 2^{-52} \approx 2.22 \times 10^{-16}$ . This is called the machine epsilon and it is the most important constant in numerical analysis.

When you read a number off a print(x), you can usually trust the first 15 to 16 decimal digits. Beyond that, the value is noise.

Special values

Three special "numbers" appear all the time:

+inf / -inf — produced by 1.0 / 0.0 or by overflow
nan — not a number, produced by indeterminate forms like 0.0 / 0.0 or inf - inf
-0.0 — yes, negative zero, distinguishable from +0.0 in a few rare cases

NaN is special because it is not equal to anything, including itself. This is a deliberate design choice: it lets you check whether a computation went wrong with x != x, and it makes any chained calculation that touches a NaN end in NaN (rather than silently producing a finite garbage result).

Always check for NaN at the boundary

A common source of mysterious bugs is a NaN that sneaks into your data from a missing value, a log(0), or a sqrt(-1). Use np.isnan(x).any() after loading data or after any operation that could produce one, and decide explicitly what to do.

How rounding works

When an exact arithmetic result is not representable as a float, IEEE 754 specifies round-to-nearest, ties-to-even (also called banker's rounding). Halfway cases round to the float whose last mantissa bit is zero. This eliminates the systematic upward bias that plain "round half up" would introduce.

Comparing floats

You should almost never compare floats with ==. Use a tolerance instead. NumPy ships a helper that lets you specify both absolute and relative tolerance:

The rule of thumb: rtol should be a few times machine epsilon if you expect a "good" computation, and looser if your inputs come from measurements with their own uncertainty.

Subtraction is where precision dies

We have already met catastrophic cancellation in the story chapters. It is worth restating it as a numerical principle:

If two floats $a$ and $b$ agree in their first $k$ leading bits, then $a - b$ has only $\approx 53 - k$ bits of precision left.

If $k$ is large (the numbers are close), the subtraction throws away most of the information.

The classic example: a midpoint approximation of the derivative.

Truncation error shrinks like $h^2$ at first — you can see the error drop by a factor of $100$ for each factor-of-10 reduction in $h$ . But around $h = 10^{-6}$ , round-off takes over and the error starts to grow. The optimum is somewhere near $h = \sqrt[3]{\epsilon}$ .

This is one of the most important pictures in numerical analysis, and we will return to it in Approximation and Error.

Overflow, underflow, and dynamic range

A float can hold values from roughly $2 \times 10^{-308}$ to $1.8 \times 10^{308}$ before underflow/overflow. Inside that range, arithmetic is reliable. Outside, things go wrong silently.

The classic trap: computing $\log(\sum_i e^{x_i})$ — the log-sum-exp that appears all over physics and machine learning. If any $x_i$ is around $1000$ , then $e^{x_i}$ overflows to inf. The trick is to subtract the maximum first.

The naive answer is inf. The stable answer is around $1001.41$ . Same equation, very different floating-point behavior.

A short practice problem

The textbook formula for variance,

$$\mathrm{Var}(x) = \frac{1}{n} \sum_i x_i^2 ; - ; \left( \frac{1}{n} \sum_i x_i \right)^2$$

is numerically disastrous for data with large mean and small spread, because the two terms become nearly equal and cancel.

Implement stable_variance(xs) that computes the population variance using the two-pass algorithm

$$\bar{x} = \frac{1}{n}\sum_i x_i, \qquad \mathrm{Var}(x) = \frac{1}{n} \sum_i (x_i - \bar{x})^2$$

The naive implementation is provided as naive_variance so you can see what it returns.

Check your understanding

QuestionSelect one

What does the bit pattern $s=0$ , $e=1023$ , $m=0$ represent in IEEE 754 double precision?

$0.0$

$1.0$

$1023.0$

$-0.0$

QuestionSelect one

Why do experienced numerical programmers usually avoid a == b for floating-point comparison?

The == operator is slow

Arithmetic that should produce equal values often differs by a few ULPs (units in the last place), so two values that are "morally" equal compare false

== raises an exception on nan

Python compares floats by string representation

Floating-Point Arithmetic

On this page