Dataslope logoDataslope

Capstone: Building and Evaluating an End-to-End Statistical Forecasting Pipeline

One complete workflow that threads the whole course together — explore, stabilize variance, difference to stationarity, identify orders, split chronologically, fit a model, forecast with uncertainty, and prove skill against a baseline with honest out-of-sample metrics.

This is where it all comes together. Across the course you collected a toolkit — datetime indexing, resampling, gap-filling, decomposition, stationarity testing, differencing, ACF/PACF reading, ARIMA, and honest validation. Here we run the entire pipeline, start to finish, on the airline series, and end with the only question that matters: does our model actually beat a trivial baseline on data it has never seen?

Step 1 — Load and look

Every forecast starts with eyes on the data. We name what we see before touching a model.

Code Block
Python 3.13.2

Step 2 — Confirm non-stationarity, then plan the transforms

We don't guess; we test. The raw series should fail the ADF test, and the fix-up sequence (log for variance, a regular difference for trend, a seasonal difference for the yearly cycle) should drive it to stationarity.

Code Block
Python 3.13.2

Step 3 — Decompose to understand the structure

A multiplicative decomposition confirms what the eye saw and gives us a clean picture of each component — useful for sanity-checking and for explaining the forecast to stakeholders.

Code Block
Python 3.13.2

Step 4 — Identify orders from the stationary series

With the series made stationary, the ACF/PACF become readable. We're looking for the regular short-lag structure (for p, q) and any leftover spike at the seasonal lag 12 (which tells us we need seasonal terms).

Code Block
Python 3.13.2

Step 5-7 — Split, fit, and forecast (honestly)

Now the part everything else was for. We hold out the last 24 months, fit on the past only, and forecast that held-out future. We fit two models — a seasonal ARIMA (the extra terms at lag 12 we just justified) and a plain ARIMA (no seasonal terms) — and we keep a seasonal-naive baseline. Then we score all three on the held-out months with honest out-of-sample metrics.

What 'seasonal ARIMA' adds

A seasonal ARIMA is just ARIMA with the same AR/I/MA machinery repeated at the seasonal lag. order=(1,1,1) handles the month-to-month dynamics; seasonal_order=(1,1,1,12) handles the year-to-year dynamics — including the seasonal difference we found we needed. It's the natural home for the lag-12 structure plain ARIMA left in its residuals.

Code Block
Python 3.13.2

The seasonal model posts the lowest error on data it never saw — and it beats the baseline, which is the bar every model must clear. Plain ARIMA captures the trend but, lacking seasonal terms, leaves the yearly cycle in its errors, so it trails the seasonal model. The gap between them is precisely the seasonality that plain ARIMA's lag-12 residual spike warned about on the ARIMA page — the residual diagnostic, paid off in hard numbers.

Step 8 — Visualize the forecast against reality

Code Block
Python 3.13.2

What 'done' looks like

A finished forecast is not just a line into the future. It is: a model whose assumptions you checked (stationarity), whose orders you justified (ACF/PACF), evaluated on data it never saw (chronological split), that beats a baseline on honest metrics (MAE/RMSE/MAPE), with uncertainty bands that the actuals fall inside. Anything less is a guess wearing a lab coat.

Step 9 — Real life: it doesn't end at the notebook

In production, a forecast is a living thing:

  • Retrain on a schedule. As new months arrive, refit so the model learns the latest dynamics. A model trained once and frozen slowly goes stale.
  • Monitor the error. Track the live forecast error over time; if it drifts up, the world has changed (a regime shift) and the model needs attention.
  • Keep the baseline running. If the fancy model ever stops beating seasonal-naive in production, fall back to the baseline — it's cheaper and, in that moment, better.
  • Respect the horizon. Short-horizon forecasts are trustworthy; far-out ones come with wide bands for a reason. Don't promise precision the uncertainty doesn't support.

Practice

Challenge
Python 3.13.2
Run the stationarity pipeline end to end

Reproduce the diagnosis stage on the loaded air series. Build a dict report recording the ADF p-value at each stage of the standard airline recipe:

  • "raw" — ADF p-value of air
  • "log" — ADF p-value of np.log(air)
  • "log_diff" — ADF p-value of np.log(air).diff()
  • "log_diff_sdiff" — ADF p-value of np.log(air).diff().diff(12)

Drop NaNs before each test. Also set is_stationary_final to True if the final stage's p-value is below 0.05. The full recipe should reach stationarity.

Challenge
Python 3.13.2
Honest evaluation: does the seasonal baseline beat a flat forecast?

Do a clean chronological evaluation. Hold out the last 12 months of air as the test set (train = the rest). Forecast those 12 months two ways and score each by MAE and RMSE on the held-out actuals:

  • seasonal-naive: each test month = the value 12 months earlier (the last 12 of train)
  • flat mean: every test month = the mean of train

Produce a dict evaluation with these exact key names:

  • "snaive_mae", "snaive_rmse", "mean_mae", "mean_rmse" (all floats)
  • "snaive_better"True if seasonal-naive has the lower MAE

Use only train to build both forecasts — never peek at the test set.

Check your understanding

QuestionSelect one

In the pipeline, why fit the model on log(train) and exponentiate the forecasts rather than modeling the raw passengers directly?

Because logs make the numbers smaller and faster to compute

The log stabilizes the series' variance, which grows with the level (multiplicative seasonality), making it suitable for the additive, constant-variance assumptions of ARIMA

Because ARIMA cannot accept positive numbers

Because the log removes the trend

QuestionSelect one

The seasonal model scored the lowest error on the held-out months. What single comparison makes that result meaningful rather than just a number?

That its AIC was the lowest

That it beat the seasonal-naive baseline on the same honest, held-out test

That it used more parameters than the alternatives

That its forecast line looked smooth

QuestionSelect one

Which sequence correctly orders the core pipeline steps?

Fit model -> check stationarity -> split chronologically -> evaluate

Explore -> stabilize/difference to stationarity -> identify orders -> split chronologically -> fit on train -> forecast -> compare to baseline

Split randomly -> fit -> exponentiate -> done

Decompose -> forecast the trend component -> stop

Course wrap-up

You can now take a raw, messy, seasonal, non-stationary series and:

  • get it onto a clean timeline, resample it, and fill its gaps;
  • decompose it and recognize trend, seasonality, and noise;
  • test and engineer stationarity with the ADF test and differencing;
  • read ACF/PACF to propose ARIMA orders;
  • fit a model and forecast with honest uncertainty;
  • and — most importantly — validate chronologically, measure with MAE/RMSE/MAPE, and prove your model beats a baseline without leaking the future.

That last skill is the one that separates forecasts you can stake decisions on from numbers that merely look impressive. Keep the discipline, keep a baseline running, and keep your training set firmly in the past.

On this page