The Cardinal Sin: Preventing Data Leakage with Chronological Validation Splits

Why a random train/test split is catastrophic for time series, how to evaluate forecasts honestly with chronological splits and walk-forward backtesting, the MAE/RMSE/MAPE metrics coded by hand, and the many sneaky ways the future leaks into the past.

This is the most important page in the course. You can fit a flawless ARIMA and still ship a disaster — if you evaluate it wrong. The discipline that separates a real forecaster from someone with a good-looking notebook is honest validation: measuring a model's true ability to project into a future it has never seen. Get this wrong and every accuracy number you report is a comfortable fiction.

The cardinal rule, one more time

You may only use the past to predict the future, never the reverse. Every validation technique on this page is a mechanism for enforcing that rule. Every leakage bug is a way it gets broken. If you remember one thing from this entire course, make it this.

The cardinal sin: a random split

The default split in nearly every ML tutorial — shuffle the rows, take 80% for training and 20% for testing — is catastrophic for time series. When you randomly assign rows, a test point from March ends up surrounded in the training set by February and April. The model gets to peek at both sides of every test point. It will score brilliantly in evaluation and collapse in production, where the future genuinely doesn't exist yet.

Let's measure the leakage. We'll take a smooth-ish series and "predict" held-out points two ways: a random hold-out (where each test point still has neighbours on both sides inside the training set) versus a chronological hold-out (where the test points are at the end, with no future neighbour to lean on).

The random hold-out reports a tiny error — and it's a lie. It only achieved that by interpolating between training points that lie in the test point's future. In production those future points don't exist, so the method can't work, and the honest chronological number is many times larger. The random split didn't measure forecasting; it measured interpolation.

QuestionSelect one

Why is k-fold cross-validation with random folds — the standard for ordinary tabular data — the wrong choice for time series?

It's too slow for large time series

Random folds put future observations in the training set for some folds, leaking future information into predictions about the past and inflating the score

It can only be used for classification

It requires shuffling, which pandas can't do

Measuring error honestly: MAE, RMSE, MAPE

Once you have an honest chronological forecast, you need a number for "how close." Three metrics dominate, and each makes a different trade-off. Code them by hand once and you'll never misread them again.

Metric	What it measures	Reach for it when	Watch out for
MAE	average absolute miss, in data units	you want a robust, interpretable number; every unit counts equally	doesn't single out big misses
RMSE	like MAE but squares errors first	large errors are disproportionately costly (capacity, safety)	sensitive to outliers
MAPE	average miss as a percent	comparing across series of different scales	explodes near zero, undefined at zero, asymmetric (over- vs under-forecast)

MAPE has sharp edges

MAPE divides by the actual value, so it blows up when the truth is near zero and is undefined at exactly zero. It's also asymmetric — it punishes over-forecasting and under-forecasting unequally. It's wonderful for comparing a model across products of wildly different sales volumes, but a poor choice for series that pass through zero (temperatures in Celsius, net flows). When in doubt, report MAE and RMSE, and add MAPE only when a percentage genuinely makes sense.

Beyond one split: walk-forward backtesting

A single train/test split gives one estimate of forecast skill — and one estimate is noisy. Maybe that particular tail happened to be easy or hard. Backtesting (time-series cross-validation) re-runs the forecast across many successive cut-off points and averages the skill. The iron rule holds in every fold: train is always strictly before test.

The picture above is the expanding-window (anchored) scheme: the training set grows each fold while the test window marches forward. A rolling-window scheme instead keeps the training window a fixed size and slides it — useful when old history stops being relevant. Either way, you get several honest out-of-sample scores instead of one.

Each fold trains only on its past and is scored on its future; averaging the folds gives a far steadier estimate of real skill than any single split. And notice we backtested a baseline (seasonal-naive) — because the real question is never "what's my error?" but "do I beat the baseline on the same honest test?"

QuestionSelect one

What is the defining property that every fold of a time-series backtest must satisfy?

Each fold must contain the same number of points

In every fold, all training timestamps come strictly before all test timestamps — the model only ever learns from the past

The folds must be chosen randomly for fairness

Each test point must appear in the training set of another fold

The other ways the future leaks in

A chronological split is necessary but not sufficient. Leakage sneaks in through preprocessing too. Each of these silently hands the model information it won't have at prediction time:

Scaling/normalizing with global statistics. Computing a mean/std (or min/max) over the entire series, including the test portion, leaks the future's distribution into training. Fit the scaler on train only.
Future-aware feature engineering. Centered rolling windows, shift(-k) leads, bfill, and linear interpolation all read future values. Safe for retrospective charts, leakage as model inputs.
Imputation or decomposition on the full series. Filling gaps or decomposing using all the data lets test-period values influence training.
Tuning on the test set. If you pick (p, d, q) by trying many options and keeping the one with the best test score, the test set has trained your choices. You need a separate validation set (or nested CV); the final test set is looked at once.

Look at preprocessing, not just the split

The split is the famous leak, but the subtle ones live in preprocessing. Before trusting any backtest, ask of every transformation: "could this step have seen data from the test period?" Scalers, imputers, decompositions, feature lags, and hyperparameter choices must all be derived from the training data of each fold — never from the whole series.

Practice

Code the forecasting metrics from scratch. Given true and predicted arrays, implement:

mae(y_true, y_pred) — mean absolute error
rmse(y_true, y_pred) — root mean squared error
mape(y_true, y_pred) — mean absolute percentage error (as a percent, i.e. multiplied by 100)

Then evaluate them on the provided y_true and y_pred and store the results (as plain floats) in a dict scores with keys "mae", "rmse", "mape".

Write expanding_folds(n, initial, horizon, step) that yields (train_idx, test_idx) pairs for an expanding-window backtest over n points:

The first fold trains on indices 0..initial-1 and tests on initial..initial+horizon-1.
Each later fold's training set grows to include everything before its test window; the test window advances by step.
Stop when a full horizon-length test window no longer fits.

Return the list of folds as folds (a list of (train_idx, test_idx) tuples, where each is a list/range of integer positions). Every fold's training indices must all be less than all of its test indices (leakage-free).

Evaluate a forecaster against the seasonal-naive baseline on a single chronological split. The airline air series is loaded; hold out the last 24 months as the test set (train = the rest).

Compute the test-set MAE for two forecasts of those 24 months:

mae_snaive — the seasonal-naive forecast: each test month equals the value from the same month one year earlier. For this 24-month horizon that is the last 12 training months repeated twice (np.tile(train.values[-12:], 2)).
mae_mean — a trivial flat forecast equal to the training mean, repeated 24 times.

Set baseline_wins to True if the seasonal-naive MAE is lower than the flat-mean MAE (it should be — seasonality matters here).

Check your understanding

QuestionSelect one

A model scores 2% error on a random 80/20 split but 18% error in production. What's the most likely explanation?

The production data is fundamentally different

The random split leaked future information into training, so the 2% was optimistic; the honest, leakage-free error is closer to the 18% seen in production

The model needs more training epochs

2% and 18% are both correct and unrelated

QuestionSelect one

You standardize your features using the mean and standard deviation computed over the entire dataset, then do a chronological train/test split. Is this safe?

Yes — a chronological split is all you need

No — the scaler's statistics include the test period, leaking future information; fit the scaler on the training data only

Yes, as long as you don't shuffle

It only matters for classification problems

QuestionSelect one

Why report a forecast's error from a multi-fold backtest rather than a single train/test split?

A single split is mathematically invalid

One split gives a single, noisy estimate that may have landed on an unusually easy or hard period; averaging several folds yields a more stable, trustworthy measure of skill

Backtesting lets you train on the test set

It makes the model itself more accurate

QuestionSelect one

Which metric would be the most problematic choice for evaluating forecasts of a series whose values regularly pass through and near zero (e.g., net hourly energy flow that can be positive, negative, or zero)?

MAE

RMSE

MAPE

They are all equally appropriate here

QuestionSelect one

You try 30 different (p, d, q) combinations and keep the one with the lowest error on your test set. What's the hidden problem?

Nothing — picking the best test score is the goal

The test set has now influenced model selection, so its score is optimistic; you need a separate validation set (or nested CV), and the final test set should be evaluated only once

30 combinations is too few to be reliable

ARIMA orders should never be compared

Key takeaways

A random / k-fold split leaks the future into training and reports fantasy accuracy. Time series demands chronological splits: train is the past, test is the future.
Backtest with walk-forward folds (expanding or rolling window) for a stable skill estimate — and in every fold, train strictly precedes test.
Code metrics by hand: MAE (robust, in-units), RMSE (punishes big misses), MAPE (scale-free percent, but breaks near zero).
Leakage hides in preprocessing too: fit scalers/imputers/decompositions on train only, avoid future-aware features (centered windows, leads, bfill/interpolate as inputs), and never tune on the test set.
Always compare to a baseline on the same honest backtest — skill is relative.

You now have every piece: shaping a timeline, reshaping its resolution, handling gaps, decomposing structure, achieving stationarity, reading ACF/PACF, fitting ARIMA, and — above all — validating honestly. The capstone threads them into one end-to-end pipeline.

The Cardinal Sin: Preventing Data Leakage with Chronological Validation Splits

The cardinal sin: a random split

Measuring error honestly: MAE, RMSE, MAPE

Beyond one split: walk-forward backtesting

The other ways the future leaks in

Practice

Check your understanding

On this page