Dataslope logoDataslope

Resampling and Aggregation: Changing the Temporal Resolution

Downsampling versus upsampling — using resample() to summarize a series to a coarser grid (and why sum vs mean matters) or to stretch it onto a finer grid (and why that invents data rather than discovering it).

Real time series rarely arrive at the resolution you want to analyze them at. Sensors log every second but you reason in hours; sales come in daily but the business plans monthly. Resampling is how you change the temporal resolution of a series — and it splits cleanly into two operations with very different risks.

  • Downsampling goes from fine to coarse (daily → monthly). You have more data than you need and must summarize it. The only question is which summary: sum? mean? last value?
  • Upsampling goes from coarse to fine (monthly → daily). You have fewer points than the target grid and must invent the in-between values. This does not create information — it encodes an assumption.

resample is groupby for time

df.resample("ME") is the time-aware twin of df.groupby(...). It slices the timeline into regular buckets ("every month," "every week") and then waits for you to tell it how to aggregate each bucket. The frequency string ("D", "W", "ME") names the bucket size; the method (.sum(), .mean()) names the summary.

Downsampling: summarizing to a coarser grid

Let's build a daily series — website visits with a slow upward trend and a weekend dip — and roll it up.

Code Block
Python 3.13.2

One series, two resamples, two very different numbers — and they answer different questions:

  • resample("W").sum() → "how many visits did we get each week?"
  • resample("ME").mean() → "what did a typical day look like each month?"

Sum or mean? Match the aggregation to the quantity

This is the choice beginners get wrong most often. The rule:

  • Sum when the values are flows / counts that accumulate: sales, visits, units sold, rainfall, energy consumed. Ten daily visit counts genuinely add up to a weekly total.
  • Mean when the values are levels / rates that don't accumulate: temperature, price, CPU utilization, a stock's closing level. The monthly "temperature" is the average of its days, never their sum (summing 30 daily temperatures of 20°C into "600°C for the month" is nonsense).
Code Block
Python 3.13.2
QuestionSelect one

You have hourly air-temperature readings and want one value per day. Which resampling aggregation is correct?

resample("D").sum()

resample("D").mean()

resample("D").count()

It doesn't matter; all aggregations give the same answer

More than one summary at once

You don't have to pick a single aggregation. .agg([...]) returns several columns side by side — handy for a quick profile of each bucket.

Code Block
Python 3.13.2

And monthly data downsamples further still — our airline series rolls up to quarterly or yearly views that make the long-run trend obvious:

Code Block
Python 3.13.2

Downsampling is a seasonality filter

Notice the yearly resample above has no summer hump — averaging a full year folds the 12-month seasonal cycle into a single number. Downsampling to the seasonal period (or a multiple of it) is a quick, crude way to see the trend without the seasonality shouting over it. We'll do this far more carefully with decomposition soon.

Upsampling: stretching onto a finer grid

Upsampling is the riskier direction. Going monthly → daily, you're asking for ~30x more rows than you have data for. pandas inserts the new timestamps but leaves them empty — because it has no honest value to put there.

Code Block
Python 3.13.2

Those NaNs are pandas being honest: it knows the value on January 15th 1949, but it has no idea what happened on January 16th — that day was never measured. To get numbers there you must assume something, and the assumption you choose is a modeling decision (the entire subject of the next page):

Code Block
Python 3.13.2

Upsampling does not create information

The most dangerous misconception about resampling: that upsampling reveals finer detail. It does not. A monthly series upsampled to daily has exactly as much information as it did before — one real number per month — now smeared across 30 invented slots. Every value between the real observations is a guess whose quality depends entirely on whether your fill assumption matches reality. Never report upsampled points as if they were measured.

QuestionSelect one

A colleague upsamples a monthly revenue series to daily with linear interpolation and announces, "Now we have daily revenue figures." What's the problem?

Nothing — interpolation correctly recovers the daily values

Linear interpolation is too slow on large data

Upsampling invents the in-between values; the "daily" figures are interpolated assumptions, not measurements, and carry no new information

They should have used sum instead of interpolation

resample().asfreq() vs resample().mean(): a quick contrast

  • Downsampling must aggregate (.sum(), .mean(), .last(), ...) because many points fall in each new bucket.
  • Upsampling has at most one original point per new bucket, so there's nothing to aggregate — you use .asfreq() to lay down the grid, then a fill method to populate it.
QuestionSelect one

Which statement correctly pairs the direction with what you must supply?

Downsampling needs a fill method; upsampling needs an aggregation

Downsampling needs an aggregation (sum/mean/...); upsampling needs a fill method (ffill/interpolate/...)

Both directions need an aggregation

Neither needs anything; pandas picks automatically

Practice

Challenge
Python 3.13.2
Daily visits to weekly totals

A daily visits Series (90 days) is loaded. Build weekly_total: the total visits per week, with weeks ending on Sunday (the "W" default). The result should be a Series indexed by week-ending dates.

Then also compute busiest_week — the week-ending Timestamp with the highest total.

Challenge
Python 3.13.2
Choose the right aggregation for each series

You have two daily series for June: rainfall_mm (millimetres of rain that fell each day — a FLOW) and humidity_pct (the day's average relative humidity — a LEVEL). Produce a one-row-per-month summary as a dict june with:

  • "total_rain" — June's total rainfall (the correct aggregation for a flow), as a float
  • "avg_humidity" — June's typical humidity (the correct aggregation for a level), as a float rounded to 1 decimal

Pick sum vs mean to match what each quantity means.

Check your understanding

QuestionSelect one

How is df.resample("ME") most accurately described?

A way to delete rows to make the series shorter

A time-aware groupby that buckets the index into regular periods, awaiting an aggregation to summarize each bucket

A method that always returns one row per original row

A plotting function

QuestionSelect one

A daily count of support tickets is resampled to monthly. Which aggregation gives "how many tickets did we handle that month," and why?

mean, because monthly figures should be averages

sum, because ticket counts are a flow that accumulates over the month

last, because only the final day matters

max, because the busiest day represents the month

Key takeaways

  • Resampling changes temporal resolution; it's groupby for the time axis.
  • Downsampling (fine → coarse) summarizes: choose sum for flows (sales, visits, rainfall) and mean (or min/max) for levels (temperature, price). Picking the wrong one produces nonsense.
  • Upsampling (coarse → fine) invents the in-between values. asfreq lays down the grid as NaNs; you then fill with an assumption.
  • Upsampling adds no information — never present interpolated points as measurements.
  • Downsampling to (a multiple of) the seasonal period is a quick way to see the underlying trend.

Upsampling left us staring at a grid full of NaNs, and even real-world data arrives with holes. How you fill those gaps is not a button-press but a genuine assumption about what happened when you weren't looking — which is exactly the next page.

On this page