Mind the Gap: Intelligently Handling Missing Temporal Values
Why you usually fill rather than drop temporal gaps, how to expose missing timestamps with reindex, and the three core fill assumptions — forward-fill, backward-fill, and linear interpolation — including which ones secretly peek at the future.
In an ordinary table, a missing value is often handled by deleting the row.
In a time series you usually can't — drop a row and you punch a hole in
the regular spacing that resample, rolling, and every model rely on.
Worse, a single NaN poisons any window that touches it. So the temporal
question is rarely "drop or keep?" but "what do I believe happened
during the gap?" — and every fill method is a different answer to that
question.
Two kinds of 'missing' in time series
- A missing value at a present timestamp — the row exists, but the
value is
NaN(the sensor returned nothing at 14:00). - A missing timestamp entirely — the row isn't even there (no record for 14:00 at all), so the gap is invisible until you put the series on a complete calendar.
The second is sneakier, because the series looks complete. Always check.
Step 1: expose the gaps with reindex
A series can hide gaps simply by skipping timestamps. The fix is to
reindex onto a complete date_range, which turns every absent
timestamp into an explicit NaN you can see and deal with.
Until you reindex, those two missing days are silent — averages and plots quietly pretend Jan 2 connects straight to Jan 5. Surfacing the gaps is the prerequisite for filling them honestly.
The three core fill strategies
Once gaps are explicit NaNs, you choose how to fill. The three workhorses
encode three different beliefs about the unseen interval:
The same gap — two missing points between a 10 and a 40 — gets three different answers. Let's compute exactly that:
When each one is right
- Forward-fill (
ffill) — "the last value held until it next changed." Perfect for state / step signals: a thermostat setting, a posted price, an account balance, a configuration flag. These genuinely stay constant until an event changes them. - Backward-fill (
bfill) — "the next known value applied during the gap." Useful for the leading edge (fillingNaNs before the first real reading) or when a value is logged at the end of the period it describes. - Linear interpolation — "the value glided smoothly between the two knowns." Ideal for continuous physical quantities sampled with dropouts: temperature, sensor voltage, a slowly drifting measurement.
The leakage twist: which fills peek at the future?
Here is the subtlety that separates a careful analyst from a sloppy one. Look again at what each method reads:
ffilluses only the past (the last value before the gap). It is the only method here you can compute online, at the moment of the gap, without seeing the future.bfillreads the next value — which is in the future relative to the gap.- Linear interpolation reads both the value before and the value after — so it also uses the future.
bfill and interpolation are future-aware
For historical analysis and charting, interpolation and back-fill are
perfectly fine — the whole series already exists, so using both neighbors
is legitimate. But if you are filling gaps in a feature that feeds a
forecast, bfill and interpolate smuggle the future into the present —
the same leakage sin as a centered rolling window or a random split. In an
online/forecasting setting, forward-fill is the safe default, because it
only ever looks backward.
You're building features to forecast tomorrow's value and need to fill an occasional missing sensor reading as the data streams in. Which fill method is safe to use, and why?
Linear interpolation, because it's the most accurate
Backward-fill, because it carries the correct next value
Forward-fill, because it uses only the last known (past) value and never reads anything after the gap
It doesn't matter; all fills use the same information
The domain question: is "missing" really zero?
Before any fill method, ask the most important question: does this gap mean "unknown," or does it mean "nothing happened"? They are completely different, and choosing wrong fabricates data.
No method substitutes for domain knowledge
Forward-fill, interpolation, and the rest are mechanical. They don't know whether your gap is an unrecorded measurement or a real zero. A closed shop, a sensor that's off, a holiday with no trading — these are zeros or genuine absences, not values to interpolate. Decide what the gap means first; only then pick a fill.
When dropping is actually fine
Filling isn't always required. dropna() is acceptable when gaps are
rare and scattered, you're computing an order-independent summary (like
an overall mean), and you won't subsequently rely on regular spacing. But
for anything that needs an unbroken timeline — resampling, rolling windows,
most models — fill rather than drop, so the calendar stays intact.
Why is deleting rows with dropna() often a poor choice for time series, even though it's common for cross-sectional data?
It always changes the column dtypes
It breaks the regular time spacing the series depends on, so resampling, rolling windows, and models that assume evenly spaced observations misbehave
It's computationally too expensive
pandas forbids dropping rows from a time series
Practice
A series status records a machine's power setting, logged only when it changes. Two calendar days have no row at all. Produce filled:
- Reindex
statusonto a complete daily range from its first to its last timestamp (exposing the missing days asNaN). - Forward-fill the gaps, because a power setting holds until it's next changed.
The result should be a daily Series with no NaNs, where each missing day carries the most recent prior setting.
A daily temp Series of temperatures has interior dropout days as NaN. Temperature varies continuously, so fill the interior gaps with linear interpolation. Produce smooth where every originally-interior NaN is replaced by the straight-line value between its known neighbours.
Then compute may4 — the interpolated value on 2014-05-04, which sits exactly halfway between the known May 3 (=20.0) and May 5 (=24.0), so it should be 22.0.
Check your understanding
A gap sits between a known value of 100 (Monday) and a known value of 160 (Thursday), with Tuesday and Wednesday missing. Match each method to the Tuesday/Wednesday pair it produces.
ffill -> 120/140, bfill -> 100/100, interpolate -> 160/160
ffill -> 100/100, bfill -> 160/160, interpolate -> 120/140
ffill -> 160/160, bfill -> 100/100, interpolate -> 130/130
All three give 130/130
A store is closed on public holidays, so those days have no sales rows. What's the right way to handle them before computing monthly totals?
Linearly interpolate the holiday sales from the surrounding open days
Forward-fill the previous day's sales onto the holiday
Fill those days with 0, because "closed" means genuinely zero sales, not unknown sales
Drop the holidays and ignore them
For historical analysis of an already-complete dataset (not a live forecast), is it acceptable to use linear interpolation to fill interior gaps in a continuously varying sensor signal?
No, interpolation is never allowed
Yes — on a complete historical series, using both neighbours is legitimate, and interpolation suits a continuously varying quantity
Only if you also shuffle the series first
Only for categorical data
Key takeaways
- In time series you usually fill gaps rather than drop rows, to preserve the regular spacing tooling depends on.
- Expose hidden gaps by
reindex-ing onto a completedate_range— missing timestamps are invisible until you do. - The three fills are three assumptions:
ffill(value held),bfill(next value applied), linear interpolation (smooth glide). - Leakage check:
ffilllooks only backward (safe for live features);bfillandinterpolateread the future (fine for retrospective analysis, leakage for forecasting features). - Ask whether a gap means "unknown" or "zero" — a closed shop is a real 0, not a value to interpolate. Domain knowledge beats any method.
Your series is now clean, evenly spaced, and gap-free. That means we can finally take it apart — separating the trend, the seasonal cycle, and the leftover noise into pieces we can study one at a time.
Moving Windows: Smoothing Data with Rolling and Expanding Statistics
Shifting and lagging with shift(), moving averages with rolling(), cumulative statistics with expanding() — the window-size trade-off, the trailing-vs-centered leakage trap, and why a moving average is a smoother of the past, never a forecast of the future.
Decomposing the Signals: Dissecting Trend, Seasonality, and Residuals
Splitting a series into trend, seasonal, and residual components with seasonal_decompose — additive vs multiplicative models, why AirPassengers needs multiplicative (or a log), reading the residual as a diagnostic, and deseasonalizing to reveal true growth.