Working with Distributions

The practical scipy.stats toolkit — the four-questions-to-four-methods map, the loc/scale convention and frozen distributions, fitting a distribution to data with .fit, and sanity-checking the fit with histogram overlays and Q-Q plot intuition.

You've now met the main distributions individually. This page is about the workflow — the small, reusable set of moves that turns any distribution in scipy.stats into concrete answers, and the discipline of fitting a distribution to real data and then checking whether it actually fits before you trust it. Master this and a new distribution is never intimidating again: it's just the same four methods with different parameters.

The mindset is a two-step loop you'll use constantly as a data scientist: model the data (pick a distribution and estimate its parameters), then use the model to answer questions (probabilities, thresholds, simulations). The catch — and the part people skip — is the sanity check in between. A model that doesn't fit gives precise, authoritative, wrong answers.

Four questions, four methods

Almost everything you'll ask of a distribution is one of four questions, and each maps to a method (or a pair) on the frozen distribution object. This table is the whole API in a nutshell:

You want to know...	Method	Returns
Density / mass at a value — how concentrated is the distribution here?	`.pdf(x)` (continuous), `.pmf(k)` (discrete)	height / point probability
Accumulation — what fraction is below (or above) a value?	`.cdf(x)` = P(X ≤ x); `.sf(x)` = P(X > x)	a probability
Quantile — what value sits at a given percentile?	`.ppf(p)`; `.isf(p)` (upper tail)	a value
Simulation — draw random samples	`.rvs(size=...)`	random data

Plus the summaries .mean(), .var(), .std(), and .median(). That's it — learn these once and every distribution behaves the same way.

cdf and ppf are inverses; sf and isf are their upper-tail twins

cdf (value → left-tail probability) and ppf (probability → value) undo each other. For the upper tail, sf(x) = 1 - cdf(x) gives P(X > x), and isf(p) returns the value with upper-tail probability p — handy for "what threshold leaves only 5% above it?" (isf(0.05)). Using sf/isf instead of 1 - cdf is also more numerically accurate deep in the tail.

One distribution, all four questions

Here's the entire toolkit exercised on a single frozen normal, so you can see how the pieces relate.

The loc/scale convention and frozen distributions

Two habits make scipy.stats painless:

The loc/scale convention. Nearly every distribution takes loc (a shift) and scale (a stretch). For the normal, loc is the mean and scale is the standard deviation. For the exponential, scale is the mean. For the uniform, loc is the left edge and scale is the width. The meaning changes per distribution, but the interface is identical.

Frozen distributions. Calling stats.norm(loc=30, scale=6) once returns an object with the parameters baked in. Pass that object around and call methods on it, instead of repeating loc= and scale= on every call. It's cleaner and prevents parameter-mismatch bugs.

Fitting a distribution to data

So far we've assumed the parameters. With real data you usually estimate them: hand the data to .fit() and scipy returns the parameters that best match it (by maximum likelihood). For the normal, .fit() returns (loc, scale) — essentially the sample mean and standard deviation.

Fitting bounded distributions: fix loc with floc

Some distributions have a location parameter that doesn't belong in your problem. For positive-only data like incomes or wait times, you usually want the distribution anchored at zero, so fix it during the fit: stats.lognorm.fit(data, floc=0) or stats.expon.fit(data, floc=0). Without floc=0, scipy may slide loc to a small nonzero value that fits the sample's minimum but makes the parameters hard to interpret.

Sanity-checking the fit

Fitting always returns parameters — even for a distribution that's completely wrong for your data. The numbers don't tell you the fit is good; you have to check. Two standard checks:

Histogram overlay: plot the data's histogram (as a density) and draw the fitted PDF on top. Do they have the same shape — same center, spread, skew, and tails?
Q–Q plot: plot the data's quantiles against the fitted distribution's quantiles. If the fit is good, the points fall on a straight line. Systematic curving away from the line reveals exactly how the fit fails (heavy tails, skew, etc.).

Histogram overlay

Q–Q plot intuition

A Q–Q (quantile–quantile) plot is the sharper check. The idea:

Sort the data — these are the sample quantiles.
For each, compute where the fitted distribution says that quantile should fall — the theoretical quantiles.
Plot sample-vs-theoretical. A good fit makes the points hug the diagonal line y = x. Curvature, or tails that peel off the line, means the model misses the data's shape there.

A high Q-Q correlation is suggestive, not proof

Points hugging the line is evidence the chosen distribution is plausible — not a guarantee the data "follows" it. Real data is never exactly any textbook distribution. Treat the Q–Q plot as a diagnostic for spotting how a model fails (heavy tails, skew), and as a check that the model is good enough for the decision at hand — not as a proof of the true data-generating process.

You're given a sample of measured values in data (a NumPy array). Assume a Normal model.

Fit a normal with stats.norm.fit(data), which returns (loc, scale). Store them in loc and scale (both float).
Build the fitted distribution and compute p_above_100 — the probability the value exceeds 100, P(X > 100), as a float.

Hints:

loc, scale = stats.norm.fit(data).
A frozen fitted distribution: fitted = stats.norm(loc=loc, scale=scale).
P(X > 100) is fitted.sf(100).

Using the same sample in data, fit a Normal model and use it to set a capacity threshold.

Compute cutoff_95 — the 95th-percentile value of the fitted distribution (the value with 95% of outcomes at or below it), as a float.

Hints:

Fit with loc, scale = stats.norm.fit(data).
The 95th percentile is stats.norm(loc=loc, scale=scale).ppf(0.95).

Don't extrapolate a fitted model past your data

A fitted model is only trustworthy in the range where you have data. The tails of a fitted distribution are an extrapolation — they're governed by the assumed shape, not by observations you actually made. If your data tops out around 120, asking the fitted normal for P(X > 300) is answering a question the data never spoke to.

Two misconceptions to retire

(1) Fitting proves the data follows the distribution. It doesn't. .fit() returns parameters for any distribution you ask, fit or not — you must verify with an overlay or Q–Q plot, and even a good fit only means "plausible and good enough," never "this is the true law." (2) A fitted model is valid everywhere. It's only credible within the observed range. Extreme-tail probabilities are extrapolations driven by the assumed shape; for genuine tail risk, that shape assumption is doing all the work, so choose it carefully (heavy-tailed data needs a heavy-tailed model).

Check your understanding

QuestionSelect one

You want the value with only 5% of the distribution above it (the upper-tail threshold). Which single method gives it most directly?

.cdf(0.05)

.ppf(0.05)

.isf(0.05)

.sf(0.05)

QuestionSelect one

You run stats.norm.fit(data) on a strongly right-skewed dataset and get back (loc, scale). What can you conclude?

The data follows a normal distribution, since the fit succeeded

The fit failed because the data is skewed

You have the best-fitting normal parameters, but you must still check (overlay/Q-Q) whether a normal is appropriate — and for skewed data it likely isn't

The skew will be automatically corrected by the fit

QuestionSelect one

In a Q–Q plot comparing your data to a fitted distribution, the points follow the diagonal line in the middle but curve upward sharply at the right end. What does this indicate?

The fit is perfect

The data has a heavier right tail than the fitted distribution — extreme high values occur more often than the model predicts

The data has fewer values than the model assumes

The loc parameter was estimated incorrectly

QuestionSelect one

You fit a normal to response times that range from 50 to 130 ms, then use the model to report P(X > 400). Why should you be skeptical of that number?

Because sf is inaccurate for large values

Because the normal can't produce values above its mean

Because 400 is far outside the observed data range, so that probability is an extrapolation governed entirely by the assumed normal shape, not by any observations

Because fitted models can only answer questions about the mean

Key takeaways

Every distribution answers the same four questions: density (pdf/pmf), accumulation (cdf/sf), quantile (ppf/isf), and simulation (rvs) — plus .mean/.var/.std.
The loc/scale convention and frozen distributions give one consistent, tidy interface across all of scipy.stats.
.fit() estimates parameters from data; for bounded data, fix the location with floc=0.
Always sanity-check the fit with a histogram overlay and a Q–Q plot — points on a line means a plausible fit, curvature shows how it fails.
Fitting does not prove the data follows the distribution, and a fitted model is only trustworthy within the observed range — its far tails are extrapolation.
This model-then-question loop underpins the inference ahead: we'll lean on these methods in The Normal Distribution recap and throughout Confidence Intervals and the testing chapters.

Working with Distributions

On this page