When Mathematics Meets Computation

How floating-point arithmetic, finite precision, and discrete approximation change the way we have to think about mathematics

We close the story half of this course with a meditation. The previous three pages were history; this page is philosophy.

When a mathematician writes $\sin(x)$ , they mean an exact, infinitely precise function defined for every real number. When a Python program writes math.sin(x), it means an approximation of that function, evaluated at a finite number of bits, with a bounded error. Those two sins do not behave the same way, and the gap between them is where computational science lives.

This page does not introduce many new tools. It introduces a mindset — the conviction that every number on a computer is slightly wrong, and that doing science well means making the wrongness controllable.

Continuous mathematics, discrete machines

Three lossy steps separate mathematics from a number you can print.

Discretization. A continuous quantity — time, space, a wave — is replaced by samples on a finite grid.
Floating-point arithmetic. Each sample is stored in at most 64 bits, so almost no real number is represented exactly.
Algorithmic approximation. An infinite series, a derivative, an integral is replaced by a finite procedure that converges "close enough."

Every chapter in the rest of this course is about understanding the error introduced by one of these three steps and learning to control it.

How wrong is a floating-point number?

A 64-bit IEEE 754 double can represent about $1.8 \times 10^{308}$ distinct positive values, but it cannot represent any of them exactly between most pairs. The spacing between representable numbers near $1$ is about $\epsilon \approx 2.22 \times 10^{-16}$ — the famous machine epsilon. You will see this number again and again.

In ordinary mathematics, $(x+1) - x$ always equals $1$ . In floating-point, it can equal $0$ . The associative property of addition does not hold for floating-point numbers. This single fact has motivated entire research programs into numerical algorithm design.

The contract you sign with finite precision

When you start computing in floating point, you sign an implicit contract:

Every elementary operation $\circ$ produces the correctly rounded result: the true real-number answer, projected to the nearest representable float. The relative error of any single operation is at most $\epsilon/2$ .

This sounds reassuring — and for a single operation it is. The trouble is that long calculations chain millions of operations together, and the errors interact.

Some interactions are benign:

Bounded growth. If errors of size $\epsilon$ stay of size $O(\epsilon)$ after $N$ steps, we call the algorithm stable.

Some interactions are catastrophic:

Cancellation. Subtracting two nearly equal numbers can erase most of their significant digits in a single step. We saw this in the very first chapter.
Amplification. Some operations multiply error by a large factor (the condition number). A small input error becomes a huge output error.

Here is a single number that captures how good your "data" needs to be for a problem (not an algorithm — a problem) to be solvable at all. Multiplying a matrix by a vector might be well-conditioned; inverting the same matrix might be wildly ill-conditioned.

A condition number of $10^{18}$ means a relative input error of $\epsilon \approx 10^{-16}$ becomes a relative output error of $\epsilon \cdot 10^{18} = 100$ — utterly useless. No clever algorithm can fix this; the problem itself is broken at 64-bit precision.

Approximation: a finite version of an infinite idea

Almost every "function" in math or numpy is an approximation. math.sin(x)? A polynomial of about degree 12. math.exp(x)? Argument reduction plus a small Taylor expansion. math.log(x)? A rational approximation of carefully chosen degree.

You can build your own from scratch. Here is a Taylor approximation of $\sin$ around zero — a sequence of polynomials that converge to $\sin$ as you add more terms.

Two lessons fall out of this picture:

Convergence is local. The Taylor series for sin converges everywhere, but very slowly far from zero. Real libraries use argument reduction first (reducing $x$ to $[-\pi/4, \pi/4]$ ) and then a low-degree polynomial.
More terms is not always better. Adding terms reduces truncation error but each additional addition contributes round-off error. There is always an optimum, and finding it is part of numerical analysis.

Reproducibility: same code, same numbers

Reproducibility in science used to mean "another scientist can follow your steps and get the same answer." In computational science it has a sharper meaning: the bits should match. The same code, run with the same library versions on the same hardware, should produce the same floating-point output.

This is harder than it sounds. Three things can break bit-for-bit reproducibility:

BLAS implementations differ. OpenBLAS, MKL, Apple's Accelerate all implement matrix multiplication slightly differently and can produce different rounding in the last bit.
Threading order changes results. A parallel sum that adds floats in a non-deterministic order will round differently each run.
Random number generators must be seeded. np.random without a seed produces different sequences every run.

The minimum hygiene is to always seed your RNG and to always pin your library versions. Here is the canonical setup we will use throughout the course.

The two prints will always agree, on any machine, today and a decade from now (as long as NumPy keeps its RNG contract — which it has guaranteed for default_rng since NumPy 1.17).

The role of computational science in the modern world

Every node on that map runs on the same numerical primitives this course teaches: floating-point arithmetic, array operations, matrix factorizations, ODE integration, optimization, Monte Carlo sampling, and reproducible workflows. The application domains look very different from the outside; from the inside they share an intellectual core that has barely changed in fifty years.

That core is what the rest of the course is about. From the next page on, the story stops and the practice begins.

Check your understanding

QuestionSelect one

Which of these is the most accurate characterization of how floating-point arithmetic differs from ordinary real-number arithmetic?

Floating-point arithmetic is slower

Floating-point operations are individually correctly rounded, but compound operations are not associative and can suffer catastrophic cancellation, so algebraically equivalent expressions can produce different numerical answers

Floating-point can only represent integers

Floating-point uses base 10 internally

QuestionSelect one

The condition number $\kappa(A)$ of a matrix tells you something fundamental about a problem. Which statement best captures what it tells you?

It measures how fast a solver will run

It measures how much a relative input error gets amplified into a relative output error: a problem with $\kappa = 10^k$ can lose up to $k$ digits of precision regardless of algorithm

It measures memory usage

It measures the symmetry of $A$