Resampling and Aggregation: Changing the Temporal Resolution
Downsampling versus upsampling — using resample() to summarize a series to a coarser grid (and why sum vs mean matters) or to stretch it onto a finer grid (and why that invents data rather than discovering it).
Real time series rarely arrive at the resolution you want to analyze them at. Sensors log every second but you reason in hours; sales come in daily but the business plans monthly. Resampling is how you change the temporal resolution of a series — and it splits cleanly into two operations with very different risks.
- Downsampling goes from fine to coarse (daily → monthly). You have more data than you need and must summarize it. The only question is which summary: sum? mean? last value?
- Upsampling goes from coarse to fine (monthly → daily). You have fewer points than the target grid and must invent the in-between values. This does not create information — it encodes an assumption.
resample is groupby for time
df.resample("ME") is the time-aware twin of df.groupby(...). It slices
the timeline into regular buckets ("every month," "every week") and then
waits for you to tell it how to aggregate each bucket. The frequency string
("D", "W", "ME") names the bucket size; the method (.sum(), .mean())
names the summary.
Downsampling: summarizing to a coarser grid
Let's build a daily series — website visits with a slow upward trend and a weekend dip — and roll it up.
One series, two resamples, two very different numbers — and they answer different questions:
resample("W").sum()→ "how many visits did we get each week?"resample("ME").mean()→ "what did a typical day look like each month?"
Sum or mean? Match the aggregation to the quantity
This is the choice beginners get wrong most often. The rule:
- Sum when the values are flows / counts that accumulate: sales, visits, units sold, rainfall, energy consumed. Ten daily visit counts genuinely add up to a weekly total.
- Mean when the values are levels / rates that don't accumulate: temperature, price, CPU utilization, a stock's closing level. The monthly "temperature" is the average of its days, never their sum (summing 30 daily temperatures of 20°C into "600°C for the month" is nonsense).
You have hourly air-temperature readings and want one value per day. Which resampling aggregation is correct?
resample("D").sum()
resample("D").mean()
resample("D").count()
It doesn't matter; all aggregations give the same answer
More than one summary at once
You don't have to pick a single aggregation. .agg([...]) returns several
columns side by side — handy for a quick profile of each bucket.
And monthly data downsamples further still — our airline series rolls up to quarterly or yearly views that make the long-run trend obvious:
Downsampling is a seasonality filter
Notice the yearly resample above has no summer hump — averaging a full year folds the 12-month seasonal cycle into a single number. Downsampling to the seasonal period (or a multiple of it) is a quick, crude way to see the trend without the seasonality shouting over it. We'll do this far more carefully with decomposition soon.
Upsampling: stretching onto a finer grid
Upsampling is the riskier direction. Going monthly → daily, you're asking for ~30x more rows than you have data for. pandas inserts the new timestamps but leaves them empty — because it has no honest value to put there.
Those NaNs are pandas being honest: it knows the value on January 15th
1949, but it has no idea what happened on January 16th — that day was never
measured. To get numbers there you must assume something, and the
assumption you choose is a modeling decision (the entire subject of the
next page):
Upsampling does not create information
The most dangerous misconception about resampling: that upsampling reveals finer detail. It does not. A monthly series upsampled to daily has exactly as much information as it did before — one real number per month — now smeared across 30 invented slots. Every value between the real observations is a guess whose quality depends entirely on whether your fill assumption matches reality. Never report upsampled points as if they were measured.
A colleague upsamples a monthly revenue series to daily with linear interpolation and announces, "Now we have daily revenue figures." What's the problem?
Nothing — interpolation correctly recovers the daily values
Linear interpolation is too slow on large data
Upsampling invents the in-between values; the "daily" figures are interpolated assumptions, not measurements, and carry no new information
They should have used sum instead of interpolation
resample().asfreq() vs resample().mean(): a quick contrast
- Downsampling must aggregate (
.sum(),.mean(),.last(), ...) because many points fall in each new bucket. - Upsampling has at most one original point per new bucket, so there's
nothing to aggregate — you use
.asfreq()to lay down the grid, then a fill method to populate it.
Which statement correctly pairs the direction with what you must supply?
Downsampling needs a fill method; upsampling needs an aggregation
Downsampling needs an aggregation (sum/mean/...); upsampling needs a fill method (ffill/interpolate/...)
Both directions need an aggregation
Neither needs anything; pandas picks automatically
Practice
A daily visits Series (90 days) is loaded. Build weekly_total: the total visits per week, with weeks ending on Sunday (the "W" default). The result should be a Series indexed by week-ending dates.
Then also compute busiest_week — the week-ending Timestamp with the highest total.
You have two daily series for June: rainfall_mm (millimetres of rain that fell each day — a FLOW) and humidity_pct (the day's average relative humidity — a LEVEL). Produce a one-row-per-month summary as a dict june with:
"total_rain"— June's total rainfall (the correct aggregation for a flow), as a float"avg_humidity"— June's typical humidity (the correct aggregation for a level), as a float rounded to 1 decimal
Pick sum vs mean to match what each quantity means.
Check your understanding
How is df.resample("ME") most accurately described?
A way to delete rows to make the series shorter
A time-aware groupby that buckets the index into regular periods, awaiting an aggregation to summarize each bucket
A method that always returns one row per original row
A plotting function
A daily count of support tickets is resampled to monthly. Which aggregation gives "how many tickets did we handle that month," and why?
mean, because monthly figures should be averages
sum, because ticket counts are a flow that accumulates over the month
last, because only the final day matters
max, because the busiest day represents the month
Key takeaways
- Resampling changes temporal resolution; it's
groupbyfor the time axis. - Downsampling (fine → coarse) summarizes: choose
sumfor flows (sales, visits, rainfall) andmean(or min/max) for levels (temperature, price). Picking the wrong one produces nonsense. - Upsampling (coarse → fine) invents the in-between values.
asfreqlays down the grid asNaNs; you then fill with an assumption. - Upsampling adds no information — never present interpolated points as measurements.
- Downsampling to (a multiple of) the seasonal period is a quick way to see the underlying trend.
Upsampling left us staring at a grid full of NaNs, and even real-world
data arrives with holes. How you fill those gaps is not a button-press but
a genuine assumption about what happened when you weren't looking — which
is exactly the next page.
Mastering the Pandas Timeline: DatetimeIndex, Frequency, and Alignment
How pandas turns dates into a first-class index — parsing with to_datetime, the DatetimeIndex, Timestamp vs Period, frequency strings, partial-string slicing, the .dt accessor, and the automatic alignment that makes time series arithmetic safe.
Moving Windows: Smoothing Data with Rolling and Expanding Statistics
Shifting and lagging with shift(), moving averages with rolling(), cumulative statistics with expanding() — the window-size trade-off, the trailing-vs-centered leakage trap, and why a moving average is a smoother of the past, never a forecast of the future.