Capstone: Building and Evaluating an End-to-End Statistical Forecasting Pipeline
One complete workflow that threads the whole course together — explore, stabilize variance, difference to stationarity, identify orders, split chronologically, fit a model, forecast with uncertainty, and prove skill against a baseline with honest out-of-sample metrics.
This is where it all comes together. Across the course you collected a toolkit — datetime indexing, resampling, gap-filling, decomposition, stationarity testing, differencing, ACF/PACF reading, ARIMA, and honest validation. Here we run the entire pipeline, start to finish, on the airline series, and end with the only question that matters: does our model actually beat a trivial baseline on data it has never seen?
Step 1 — Load and look
Every forecast starts with eyes on the data. We name what we see before touching a model.
Step 2 — Confirm non-stationarity, then plan the transforms
We don't guess; we test. The raw series should fail the ADF test, and the fix-up sequence (log for variance, a regular difference for trend, a seasonal difference for the yearly cycle) should drive it to stationarity.
Step 3 — Decompose to understand the structure
A multiplicative decomposition confirms what the eye saw and gives us a clean picture of each component — useful for sanity-checking and for explaining the forecast to stakeholders.
Step 4 — Identify orders from the stationary series
With the series made stationary, the ACF/PACF become readable. We're looking
for the regular short-lag structure (for p, q) and any leftover spike at
the seasonal lag 12 (which tells us we need seasonal terms).
Step 5-7 — Split, fit, and forecast (honestly)
Now the part everything else was for. We hold out the last 24 months, fit on the past only, and forecast that held-out future. We fit two models — a seasonal ARIMA (the extra terms at lag 12 we just justified) and a plain ARIMA (no seasonal terms) — and we keep a seasonal-naive baseline. Then we score all three on the held-out months with honest out-of-sample metrics.
What 'seasonal ARIMA' adds
A seasonal ARIMA is just ARIMA with the same AR/I/MA machinery repeated at
the seasonal lag. order=(1,1,1) handles the month-to-month dynamics;
seasonal_order=(1,1,1,12) handles the year-to-year dynamics — including the
seasonal difference we found we needed. It's the natural home for the
lag-12 structure plain ARIMA left in its residuals.
The seasonal model posts the lowest error on data it never saw — and it beats the baseline, which is the bar every model must clear. Plain ARIMA captures the trend but, lacking seasonal terms, leaves the yearly cycle in its errors, so it trails the seasonal model. The gap between them is precisely the seasonality that plain ARIMA's lag-12 residual spike warned about on the ARIMA page — the residual diagnostic, paid off in hard numbers.
Step 8 — Visualize the forecast against reality
What 'done' looks like
A finished forecast is not just a line into the future. It is: a model whose assumptions you checked (stationarity), whose orders you justified (ACF/PACF), evaluated on data it never saw (chronological split), that beats a baseline on honest metrics (MAE/RMSE/MAPE), with uncertainty bands that the actuals fall inside. Anything less is a guess wearing a lab coat.
Step 9 — Real life: it doesn't end at the notebook
In production, a forecast is a living thing:
- Retrain on a schedule. As new months arrive, refit so the model learns the latest dynamics. A model trained once and frozen slowly goes stale.
- Monitor the error. Track the live forecast error over time; if it drifts up, the world has changed (a regime shift) and the model needs attention.
- Keep the baseline running. If the fancy model ever stops beating seasonal-naive in production, fall back to the baseline — it's cheaper and, in that moment, better.
- Respect the horizon. Short-horizon forecasts are trustworthy; far-out ones come with wide bands for a reason. Don't promise precision the uncertainty doesn't support.
Practice
Reproduce the diagnosis stage on the loaded air series. Build a dict report recording the ADF p-value at each stage of the standard airline recipe:
"raw"— ADF p-value ofair"log"— ADF p-value ofnp.log(air)"log_diff"— ADF p-value ofnp.log(air).diff()"log_diff_sdiff"— ADF p-value ofnp.log(air).diff().diff(12)
Drop NaNs before each test. Also set is_stationary_final to True if the final stage's p-value is below 0.05. The full recipe should reach stationarity.
Do a clean chronological evaluation. Hold out the last 12 months of air as the test set (train = the rest). Forecast those 12 months two ways and score each by MAE and RMSE on the held-out actuals:
- seasonal-naive: each test month = the value 12 months earlier (the last 12 of
train) - flat mean: every test month = the mean of
train
Produce a dict evaluation with these exact key names:
"snaive_mae","snaive_rmse","mean_mae","mean_rmse"(all floats)"snaive_better"—Trueif seasonal-naive has the lower MAE
Use only train to build both forecasts — never peek at the test set.
Check your understanding
In the pipeline, why fit the model on log(train) and exponentiate the forecasts rather than modeling the raw passengers directly?
Because logs make the numbers smaller and faster to compute
The log stabilizes the series' variance, which grows with the level (multiplicative seasonality), making it suitable for the additive, constant-variance assumptions of ARIMA
Because ARIMA cannot accept positive numbers
Because the log removes the trend
The seasonal model scored the lowest error on the held-out months. What single comparison makes that result meaningful rather than just a number?
That its AIC was the lowest
That it beat the seasonal-naive baseline on the same honest, held-out test
That it used more parameters than the alternatives
That its forecast line looked smooth
Which sequence correctly orders the core pipeline steps?
Fit model -> check stationarity -> split chronologically -> evaluate
Explore -> stabilize/difference to stationarity -> identify orders -> split chronologically -> fit on train -> forecast -> compare to baseline
Split randomly -> fit -> exponentiate -> done
Decompose -> forecast the trend component -> stop
Course wrap-up
You can now take a raw, messy, seasonal, non-stationary series and:
- get it onto a clean timeline, resample it, and fill its gaps;
- decompose it and recognize trend, seasonality, and noise;
- test and engineer stationarity with the ADF test and differencing;
- read ACF/PACF to propose ARIMA orders;
- fit a model and forecast with honest uncertainty;
- and — most importantly — validate chronologically, measure with MAE/RMSE/MAPE, and prove your model beats a baseline without leaking the future.
That last skill is the one that separates forecasts you can stake decisions on from numbers that merely look impressive. Keep the discipline, keep a baseline running, and keep your training set firmly in the past.