Working with Distributions
The practical scipy.stats toolkit — the four-questions-to-four-methods map, the loc/scale convention and frozen distributions, fitting a distribution to data with .fit, and sanity-checking the fit with histogram overlays and Q-Q plot intuition.
You've now met the main distributions individually. This page is about
the workflow — the small, reusable set of moves that turns any
distribution in scipy.stats into concrete answers, and the discipline
of fitting a distribution to real data and then checking whether it
actually fits before you trust it. Master this and a new distribution
is never intimidating again: it's just the same four methods with
different parameters.
The mindset is a two-step loop you'll use constantly as a data scientist: model the data (pick a distribution and estimate its parameters), then use the model to answer questions (probabilities, thresholds, simulations). The catch — and the part people skip — is the sanity check in between. A model that doesn't fit gives precise, authoritative, wrong answers.
Four questions, four methods
Almost everything you'll ask of a distribution is one of four questions, and each maps to a method (or a pair) on the frozen distribution object. This table is the whole API in a nutshell:
| You want to know... | Method | Returns |
|---|---|---|
| Density / mass at a value — how concentrated is the distribution here? | .pdf(x) (continuous), .pmf(k) (discrete) | height / point probability |
| Accumulation — what fraction is below (or above) a value? | .cdf(x) = P(X ≤ x); .sf(x) = P(X > x) | a probability |
| Quantile — what value sits at a given percentile? | .ppf(p); .isf(p) (upper tail) | a value |
| Simulation — draw random samples | .rvs(size=...) | random data |
Plus the summaries .mean(), .var(), .std(), and .median(). That's
it — learn these once and every distribution behaves the same way.
cdf and ppf are inverses; sf and isf are their upper-tail twins
cdf (value → left-tail probability) and ppf (probability → value)
undo each other. For the upper tail, sf(x) = 1 - cdf(x) gives
P(X > x), and isf(p) returns the value with upper-tail probability p
— handy for "what threshold leaves only 5% above it?" (isf(0.05)).
Using sf/isf instead of 1 - cdf is also more numerically accurate
deep in the tail.
One distribution, all four questions
Here's the entire toolkit exercised on a single frozen normal, so you can see how the pieces relate.
The loc/scale convention and frozen distributions
Two habits make scipy.stats painless:
The loc/scale convention. Nearly every distribution takes loc
(a shift) and scale (a stretch). For the normal, loc is the mean and
scale is the standard deviation. For the exponential, scale is the
mean. For the uniform, loc is the left edge and scale is the width.
The meaning changes per distribution, but the interface is identical.
Frozen distributions. Calling stats.norm(loc=30, scale=6) once
returns an object with the parameters baked in. Pass that object around
and call methods on it, instead of repeating loc= and scale= on every
call. It's cleaner and prevents parameter-mismatch bugs.
Fitting a distribution to data
So far we've assumed the parameters. With real data you usually
estimate them: hand the data to .fit() and scipy returns the
parameters that best match it (by maximum likelihood). For the normal,
.fit() returns (loc, scale) — essentially the sample mean and
standard deviation.
Fitting bounded distributions: fix loc with floc
Some distributions have a location parameter that doesn't belong in your
problem. For positive-only data like incomes or wait times, you usually
want the distribution anchored at zero, so fix it during the fit:
stats.lognorm.fit(data, floc=0) or stats.expon.fit(data, floc=0).
Without floc=0, scipy may slide loc to a small nonzero value that
fits the sample's minimum but makes the parameters hard to interpret.
Sanity-checking the fit
Fitting always returns parameters — even for a distribution that's completely wrong for your data. The numbers don't tell you the fit is good; you have to check. Two standard checks:
- Histogram overlay: plot the data's histogram (as a density) and draw the fitted PDF on top. Do they have the same shape — same center, spread, skew, and tails?
- Q–Q plot: plot the data's quantiles against the fitted distribution's quantiles. If the fit is good, the points fall on a straight line. Systematic curving away from the line reveals exactly how the fit fails (heavy tails, skew, etc.).
Histogram overlay
Q–Q plot intuition
A Q–Q (quantile–quantile) plot is the sharper check. The idea:
- Sort the data — these are the sample quantiles.
- For each, compute where the fitted distribution says that quantile should fall — the theoretical quantiles.
- Plot sample-vs-theoretical. A good fit makes the points hug the
diagonal line
y = x. Curvature, or tails that peel off the line, means the model misses the data's shape there.
A high Q-Q correlation is suggestive, not proof
Points hugging the line is evidence the chosen distribution is plausible — not a guarantee the data "follows" it. Real data is never exactly any textbook distribution. Treat the Q–Q plot as a diagnostic for spotting how a model fails (heavy tails, skew), and as a check that the model is good enough for the decision at hand — not as a proof of the true data-generating process.
You're given a sample of measured values in data (a NumPy array). Assume a Normal model.
- Fit a normal with
stats.norm.fit(data), which returns(loc, scale). Store them inlocandscale(bothfloat). - Build the fitted distribution and compute
p_above_100— the probability the value exceeds 100,P(X > 100), as afloat.
Hints:
loc, scale = stats.norm.fit(data).- A frozen fitted distribution:
fitted = stats.norm(loc=loc, scale=scale). P(X > 100)isfitted.sf(100).
Using the same sample in data, fit a Normal model and use it to set a capacity threshold.
Compute cutoff_95 — the 95th-percentile value of the fitted distribution (the value with 95% of outcomes at or below it), as a float.
Hints:
- Fit with
loc, scale = stats.norm.fit(data). - The 95th percentile is
stats.norm(loc=loc, scale=scale).ppf(0.95).
Don't extrapolate a fitted model past your data
A fitted model is only trustworthy in the range where you have data. The tails of a fitted distribution are an extrapolation — they're governed by the assumed shape, not by observations you actually made. If your data tops out around 120, asking the fitted normal for P(X > 300) is answering a question the data never spoke to.
Two misconceptions to retire
(1) Fitting proves the data follows the distribution. It doesn't.
.fit() returns parameters for any distribution you ask, fit or not —
you must verify with an overlay or Q–Q plot, and even a good fit only
means "plausible and good enough," never "this is the true law."
(2) A fitted model is valid everywhere. It's only credible within the
observed range. Extreme-tail probabilities are extrapolations driven by
the assumed shape; for genuine tail risk, that shape assumption is doing
all the work, so choose it carefully (heavy-tailed data needs a
heavy-tailed model).
Check your understanding
You want the value with only 5% of the distribution above it (the upper-tail threshold). Which single method gives it most directly?
.cdf(0.05)
.ppf(0.05)
.isf(0.05)
.sf(0.05)
You run stats.norm.fit(data) on a strongly right-skewed dataset and get back (loc, scale). What can you conclude?
The data follows a normal distribution, since the fit succeeded
The fit failed because the data is skewed
You have the best-fitting normal parameters, but you must still check (overlay/Q-Q) whether a normal is appropriate — and for skewed data it likely isn't
The skew will be automatically corrected by the fit
In a Q–Q plot comparing your data to a fitted distribution, the points follow the diagonal line in the middle but curve upward sharply at the right end. What does this indicate?
The fit is perfect
The data has a heavier right tail than the fitted distribution — extreme high values occur more often than the model predicts
The data has fewer values than the model assumes
The loc parameter was estimated incorrectly
You fit a normal to response times that range from 50 to 130 ms, then use the model to report P(X > 400). Why should you be skeptical of that number?
Because sf is inaccurate for large values
Because the normal can't produce values above its mean
Because 400 is far outside the observed data range, so that probability is an extrapolation governed entirely by the assumed normal shape, not by any observations
Because fitted models can only answer questions about the mean
Key takeaways
- Every distribution answers the same four questions: density
(
pdf/pmf), accumulation (cdf/sf), quantile (ppf/isf), and simulation (rvs) — plus.mean/.var/.std. - The
loc/scaleconvention and frozen distributions give one consistent, tidy interface across all ofscipy.stats. .fit()estimates parameters from data; for bounded data, fix the location withfloc=0.- Always sanity-check the fit with a histogram overlay and a Q–Q plot — points on a line means a plausible fit, curvature shows how it fails.
- Fitting does not prove the data follows the distribution, and a fitted model is only trustworthy within the observed range — its far tails are extrapolation.
- This model-then-question loop underpins the inference ahead: we'll lean on these methods in The Normal Distribution recap and throughout Confidence Intervals and the testing chapters.
The Normal Distribution
Why the bell curve shows up everywhere, the 68-95-99.7 empirical rule, z-scores and standardization, converting between raw values, z-scores, and percentiles — and the real danger of assuming data is normal when it isn't.
Sampling and Bias
How we choose samples and how sampling goes wrong — random, stratified, cluster, systematic, and convenience sampling, the classic biases that ruin inference, and the all-important distinction between bias and variance.