GroupBy Operations
Split-apply-combine — the most powerful pattern in Pandas, and arguably in all of data analysis.
If there is one operation that earns Pandas its keep, it is
groupby. Every "by department," "by month," "by user," "by
category" question in data analysis ends up here.
Split-apply-combine
The mental model is three steps:
- Split — partition rows into groups based on a column's values.
- Apply — compute something on each group (a mean, a sum, a custom function).
- Combine — stitch the per-group results back into one result.
Hadley Wickham named this pattern split-apply-combine in 2011, and it has been the conceptual backbone of tabular data analysis ever since.
The basic shape
Read it as: "From df, group by dept, pull out salary, take the mean of each group."
Multiple aggregations at once
Or apply different aggregations to different columns:
Named aggregations — much cleaner
The (column, function) tuple syntax produces flat, well-named
columns. This is the modern recommended style.
This pattern — result_name=(column, function) — produces flat
column names and self-documents the analysis. Use it.
Grouping by multiple columns
Pass a list of column names:
The result has a MultiIndex. To flatten back to a regular
DataFrame, call .reset_index():
transform — broadcast a group statistic back to rows
Sometimes you do not want a smaller summary table — you want to add a column to the original rows containing a group statistic.
Each row now knows the average of its department, and how far it sits from that average. This is a hugely useful pattern for flagging outliers within groups.
filter — keep groups, not rows
A different kind of group operation: keep entire groups that satisfy a condition.
Note: filter here takes a function that receives the whole
group as a DataFrame and returns True/False. Different from
boolean-mask filtering of rows.
apply — for anything else
apply runs an arbitrary function on each group. Slower than
the named aggregations, but flexible.
Reach for apply only when neither agg nor transform will
do.
A realistic example
Compute, for the HR dataset, average monthly income and attrition rate by department.
This single chained call answers four business questions at
once, per department. That economy is what makes groupby so
beloved.
A multi-step challenge
Load:
https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv
Build a DataFrame called summary indexed by Department with exactly these columns (in this order):
headcount— number of employees in the departmentavg_income— mean ofMonthlyIncome, rounded to nearest integerattrition_pct— percentage of employees withAttrition == "Yes", rounded to 1 decimal place
Sort the result by attrition_pct descending.
Check your understanding
The split-apply-combine pattern describes:
A coding style for indentation
A way to import data
The three steps of a groupby: partition rows into groups, compute something per group, stitch the results together
A method of sorting
What is the practical advantage of agg(my_name=("col", "mean")) over agg({"col": ["mean"]})?
It is faster
It supports more functions
It produces flat, well-named output columns instead of a MultiIndex of columns — far easier to use downstream
It works on Series only
When would you use groupby(...).transform(...) instead of .agg(...)?
When you want fewer rows
They are equivalent
When you want a column added back to every original row containing its group's statistic — the result has the same shape as the input
When sorting
In a groupby(["a","b"]) result, what kind of index does the output usually have?
A flat RangeIndex
A column index
A MultiIndex — one level for column a, one level for column b
A DatetimeIndex
Aggregation Basics
Sum, mean, median, min, max — the simple reductions that turn many rows into one number, and the subtle choices that change what they mean.
Messy Data Overview
A field guide to the kinds of mess you will find in real datasets, and the mindset that turns cleaning from a chore into a skill.