Dataslope logoDataslope

Floating-Point Arithmetic

How real numbers are squeezed into 64 bits, and what surprises come with the squeeze

Almost every number you compute in this course will be stored as a 64-bit IEEE 754 double-precision float. The next four pages build the numerical foundation of the course on top of that choice. This first page covers the format itself, the operations defined on it, and the most common ways those operations bite working scientists.

The format

A 64-bit double is a triplet of bit fields:

FieldBitsMeaning
Sign ss100 for positive, 11 for negative
Exponent ee11Biased integer, 0e20470 \le e \le 2047
Mantissa (a.k.a. fraction) mm52Binary fraction, 0m<10 \le m < 1

The value represented is

(1)s(1+m)2e1023(-1)^s \cdot (1 + m) \cdot 2^{e - 1023}

for ordinary "normal" numbers. There are also subnormals (extremely small numbers near zero), infinities (±inf), and NaNs (not-a-number values).

You can pry the bits open and look:

Code Block
Python 3.13.2

A few things to notice:

  • 1.0 has exponent 10231023 (the bias) and mantissa 00 — it is literally 1201 \cdot 2^0.
  • 2.0 differs from 1.0 in one bit: the exponent.
  • 0.1 has a non-zero mantissa, because 0.10.1 cannot be written exactly in binary — it is a repeating binary fraction.

Machine epsilon

The smallest float strictly greater than 1.01.0 is 1+ϵ1 + \epsilon where ϵ=2522.22×1016\epsilon = 2^{-52} \approx 2.22 \times 10^{-16}. This is called the machine epsilon and it is the most important constant in numerical analysis.

Code Block
Python 3.13.2

When you read a number off a print(x), you can usually trust the first 15 to 16 decimal digits. Beyond that, the value is noise.

Special values

Three special "numbers" appear all the time:

  • +inf / -inf — produced by 1.0 / 0.0 or by overflow
  • nannot a number, produced by indeterminate forms like 0.0 / 0.0 or inf - inf
  • -0.0 — yes, negative zero, distinguishable from +0.0 in a few rare cases

NaN is special because it is not equal to anything, including itself. This is a deliberate design choice: it lets you check whether a computation went wrong with x != x, and it makes any chained calculation that touches a NaN end in NaN (rather than silently producing a finite garbage result).

Code Block
Python 3.13.2

Always check for NaN at the boundary

A common source of mysterious bugs is a NaN that sneaks into your data from a missing value, a log(0), or a sqrt(-1). Use np.isnan(x).any() after loading data or after any operation that could produce one, and decide explicitly what to do.

How rounding works

When an exact arithmetic result is not representable as a float, IEEE 754 specifies round-to-nearest, ties-to-even (also called banker's rounding). Halfway cases round to the float whose last mantissa bit is zero. This eliminates the systematic upward bias that plain "round half up" would introduce.

Code Block
Python 3.13.2

Comparing floats

You should almost never compare floats with ==. Use a tolerance instead. NumPy ships a helper that lets you specify both absolute and relative tolerance:

Code Block
Python 3.13.2

The rule of thumb: rtol should be a few times machine epsilon if you expect a "good" computation, and looser if your inputs come from measurements with their own uncertainty.

Subtraction is where precision dies

We have already met catastrophic cancellation in the story chapters. It is worth restating it as a numerical principle:

If two floats aa and bb agree in their first kk leading bits, then aba - b has only 53k\approx 53 - k bits of precision left.

If kk is large (the numbers are close), the subtraction throws away most of the information.

The classic example: a midpoint approximation of the derivative.

Code Block
Python 3.13.2

Truncation error shrinks like h2h^2 at first — you can see the error drop by a factor of 100100 for each factor-of-10 reduction in hh. But around h=106h = 10^{-6}, round-off takes over and the error starts to grow. The optimum is somewhere near h=ϵ3h = \sqrt[3]{\epsilon}.

This is one of the most important pictures in numerical analysis, and we will return to it in Approximation and Error.

Overflow, underflow, and dynamic range

A float can hold values from roughly 2×103082 \times 10^{-308} to 1.8×103081.8 \times 10^{308} before underflow/overflow. Inside that range, arithmetic is reliable. Outside, things go wrong silently.

The classic trap: computing log(iexi)\log(\sum_i e^{x_i}) — the log-sum-exp that appears all over physics and machine learning. If any xix_i is around 10001000, then exie^{x_i} overflows to inf. The trick is to subtract the maximum first.

Code Block
Python 3.13.2

The naive answer is inf. The stable answer is around 1001.411001.41. Same equation, very different floating-point behavior.

A short practice problem

Challenge
Python 3.13.2
Stable variance

The textbook formula for variance,

$$\mathrm{Var}(x) = \frac{1}{n} \sum_i x_i^2 ; - ; \left( \frac{1}{n} \sum_i x_i \right)^2$$

is numerically disastrous for data with large mean and small spread, because the two terms become nearly equal and cancel.

Implement stable_variance(xs) that computes the population variance using the two-pass algorithm

$$\bar{x} = \frac{1}{n}\sum_i x_i, \qquad \mathrm{Var}(x) = \frac{1}{n} \sum_i (x_i - \bar{x})^2$$

The naive implementation is provided as naive_variance so you can see what it returns.

Check your understanding

QuestionSelect one

What does the bit pattern s=0s=0, e=1023e=1023, m=0m=0 represent in IEEE 754 double precision?

0.00.0

1.01.0

1023.01023.0

0.0-0.0

QuestionSelect one

Why do experienced numerical programmers usually avoid a == b for floating-point comparison?

The == operator is slow

Arithmetic that should produce equal values often differs by a few ULPs (units in the last place), so two values that are "morally" equal compare false

== raises an exception on nan

Python compares floats by string representation

On this page