The Cardinal Sin: Preventing Data Leakage with Chronological Validation Splits
Why a random train/test split is catastrophic for time series, how to evaluate forecasts honestly with chronological splits and walk-forward backtesting, the MAE/RMSE/MAPE metrics coded by hand, and the many sneaky ways the future leaks into the past.
This is the most important page in the course. You can fit a flawless ARIMA and still ship a disaster — if you evaluate it wrong. The discipline that separates a real forecaster from someone with a good-looking notebook is honest validation: measuring a model's true ability to project into a future it has never seen. Get this wrong and every accuracy number you report is a comfortable fiction.
The cardinal rule, one more time
You may only use the past to predict the future, never the reverse. Every validation technique on this page is a mechanism for enforcing that rule. Every leakage bug is a way it gets broken. If you remember one thing from this entire course, make it this.
The cardinal sin: a random split
The default split in nearly every ML tutorial — shuffle the rows, take 80% for training and 20% for testing — is catastrophic for time series. When you randomly assign rows, a test point from March ends up surrounded in the training set by February and April. The model gets to peek at both sides of every test point. It will score brilliantly in evaluation and collapse in production, where the future genuinely doesn't exist yet.
Let's measure the leakage. We'll take a smooth-ish series and "predict" held-out points two ways: a random hold-out (where each test point still has neighbours on both sides inside the training set) versus a chronological hold-out (where the test points are at the end, with no future neighbour to lean on).
The random hold-out reports a tiny error — and it's a lie. It only achieved that by interpolating between training points that lie in the test point's future. In production those future points don't exist, so the method can't work, and the honest chronological number is many times larger. The random split didn't measure forecasting; it measured interpolation.
Why is k-fold cross-validation with random folds — the standard for ordinary tabular data — the wrong choice for time series?
It's too slow for large time series
Random folds put future observations in the training set for some folds, leaking future information into predictions about the past and inflating the score
It can only be used for classification
It requires shuffling, which pandas can't do
Measuring error honestly: MAE, RMSE, MAPE
Once you have an honest chronological forecast, you need a number for "how close." Three metrics dominate, and each makes a different trade-off. Code them by hand once and you'll never misread them again.
| Metric | What it measures | Reach for it when | Watch out for |
|---|---|---|---|
| MAE | average absolute miss, in data units | you want a robust, interpretable number; every unit counts equally | doesn't single out big misses |
| RMSE | like MAE but squares errors first | large errors are disproportionately costly (capacity, safety) | sensitive to outliers |
| MAPE | average miss as a percent | comparing across series of different scales | explodes near zero, undefined at zero, asymmetric (over- vs under-forecast) |
MAPE has sharp edges
MAPE divides by the actual value, so it blows up when the truth is near zero and is undefined at exactly zero. It's also asymmetric — it punishes over-forecasting and under-forecasting unequally. It's wonderful for comparing a model across products of wildly different sales volumes, but a poor choice for series that pass through zero (temperatures in Celsius, net flows). When in doubt, report MAE and RMSE, and add MAPE only when a percentage genuinely makes sense.
Beyond one split: walk-forward backtesting
A single train/test split gives one estimate of forecast skill — and one estimate is noisy. Maybe that particular tail happened to be easy or hard. Backtesting (time-series cross-validation) re-runs the forecast across many successive cut-off points and averages the skill. The iron rule holds in every fold: train is always strictly before test.
The picture above is the expanding-window (anchored) scheme: the training set grows each fold while the test window marches forward. A rolling-window scheme instead keeps the training window a fixed size and slides it — useful when old history stops being relevant. Either way, you get several honest out-of-sample scores instead of one.
Each fold trains only on its past and is scored on its future; averaging the folds gives a far steadier estimate of real skill than any single split. And notice we backtested a baseline (seasonal-naive) — because the real question is never "what's my error?" but "do I beat the baseline on the same honest test?"
What is the defining property that every fold of a time-series backtest must satisfy?
Each fold must contain the same number of points
In every fold, all training timestamps come strictly before all test timestamps — the model only ever learns from the past
The folds must be chosen randomly for fairness
Each test point must appear in the training set of another fold
The other ways the future leaks in
A chronological split is necessary but not sufficient. Leakage sneaks in through preprocessing too. Each of these silently hands the model information it won't have at prediction time:
- Scaling/normalizing with global statistics. Computing a mean/std (or min/max) over the entire series, including the test portion, leaks the future's distribution into training. Fit the scaler on train only.
- Future-aware feature engineering. Centered rolling windows,
shift(-k)leads,bfill, and linear interpolation all read future values. Safe for retrospective charts, leakage as model inputs. - Imputation or decomposition on the full series. Filling gaps or decomposing using all the data lets test-period values influence training.
- Tuning on the test set. If you pick
(p, d, q)by trying many options and keeping the one with the best test score, the test set has trained your choices. You need a separate validation set (or nested CV); the final test set is looked at once.
Look at preprocessing, not just the split
The split is the famous leak, but the subtle ones live in preprocessing. Before trusting any backtest, ask of every transformation: "could this step have seen data from the test period?" Scalers, imputers, decompositions, feature lags, and hyperparameter choices must all be derived from the training data of each fold — never from the whole series.
Practice
Code the forecasting metrics from scratch. Given true and predicted arrays, implement:
mae(y_true, y_pred)— mean absolute errorrmse(y_true, y_pred)— root mean squared errormape(y_true, y_pred)— mean absolute percentage error (as a percent, i.e. multiplied by 100)
Then evaluate them on the provided y_true and y_pred and store the results (as plain floats) in a dict scores with keys "mae", "rmse", "mape".
Write expanding_folds(n, initial, horizon, step) that yields (train_idx, test_idx) pairs for an expanding-window backtest over n points:
- The first fold trains on indices
0..initial-1and tests oninitial..initial+horizon-1. - Each later fold's training set grows to include everything before its test window; the test window advances by
step. - Stop when a full
horizon-length test window no longer fits.
Return the list of folds as folds (a list of (train_idx, test_idx) tuples, where each is a list/range of integer positions). Every fold's training indices must all be less than all of its test indices (leakage-free).
Evaluate a forecaster against the seasonal-naive baseline on a single chronological split. The airline air series is loaded; hold out the last 24 months as the test set (train = the rest).
Compute the test-set MAE for two forecasts of those 24 months:
mae_snaive— the seasonal-naive forecast: each test month equals the value from the same month one year earlier. For this 24-month horizon that is the last 12 training months repeated twice (np.tile(train.values[-12:], 2)).mae_mean— a trivial flat forecast equal to the training mean, repeated 24 times.
Set baseline_wins to True if the seasonal-naive MAE is lower than the flat-mean MAE (it should be — seasonality matters here).
Check your understanding
A model scores 2% error on a random 80/20 split but 18% error in production. What's the most likely explanation?
The production data is fundamentally different
The random split leaked future information into training, so the 2% was optimistic; the honest, leakage-free error is closer to the 18% seen in production
The model needs more training epochs
2% and 18% are both correct and unrelated
You standardize your features using the mean and standard deviation computed over the entire dataset, then do a chronological train/test split. Is this safe?
Yes — a chronological split is all you need
No — the scaler's statistics include the test period, leaking future information; fit the scaler on the training data only
Yes, as long as you don't shuffle
It only matters for classification problems
Why report a forecast's error from a multi-fold backtest rather than a single train/test split?
A single split is mathematically invalid
One split gives a single, noisy estimate that may have landed on an unusually easy or hard period; averaging several folds yields a more stable, trustworthy measure of skill
Backtesting lets you train on the test set
It makes the model itself more accurate
Which metric would be the most problematic choice for evaluating forecasts of a series whose values regularly pass through and near zero (e.g., net hourly energy flow that can be positive, negative, or zero)?
MAE
RMSE
MAPE
They are all equally appropriate here
You try 30 different (p, d, q) combinations and keep the one with the lowest error on your test set. What's the hidden problem?
Nothing — picking the best test score is the goal
The test set has now influenced model selection, so its score is optimistic; you need a separate validation set (or nested CV), and the final test set should be evaluated only once
30 combinations is too few to be reliable
ARIMA orders should never be compared
Key takeaways
- A random / k-fold split leaks the future into training and reports fantasy accuracy. Time series demands chronological splits: train is the past, test is the future.
- Backtest with walk-forward folds (expanding or rolling window) for a stable skill estimate — and in every fold, train strictly precedes test.
- Code metrics by hand: MAE (robust, in-units), RMSE (punishes big misses), MAPE (scale-free percent, but breaks near zero).
- Leakage hides in preprocessing too: fit scalers/imputers/decompositions on train only, avoid future-aware features (centered windows, leads, bfill/interpolate as inputs), and never tune on the test set.
- Always compare to a baseline on the same honest backtest — skill is relative.
You now have every piece: shaping a timeline, reshaping its resolution, handling gaps, decomposing structure, achieving stationarity, reading ACF/PACF, fitting ARIMA, and — above all — validating honestly. The capstone threads them into one end-to-end pipeline.
Classic Forecasting: A Step-by-Step Guide to AR, MA, and ARIMA Models
Building AR, MA, ARMA, and ARIMA models with statsmodels — what each part means, why an MA model is not a moving average, choosing (p,d,q) from ACF/PACF, fitting and forecasting with widening uncertainty, and reading residual diagnostics.
Capstone: Building and Evaluating an End-to-End Statistical Forecasting Pipeline
One complete workflow that threads the whole course together — explore, stabilize variance, difference to stationarity, identify orders, split chronologically, fit a model, forecast with uncertainty, and prove skill against a baseline with honest out-of-sample metrics.