Dataslope logoDataslope

GroupBy Operations

Split-apply-combine — the most powerful pattern in Pandas, and arguably in all of data analysis.

If there is one operation that earns Pandas its keep, it is groupby. Every "by department," "by month," "by user," "by category" question in data analysis ends up here.

Split-apply-combine

The mental model is three steps:

  • Split — partition rows into groups based on a column's values.
  • Apply — compute something on each group (a mean, a sum, a custom function).
  • Combine — stitch the per-group results back into one result.

Hadley Wickham named this pattern split-apply-combine in 2011, and it has been the conceptual backbone of tabular data analysis ever since.

The basic shape

Code Block
Python 3.13.2

Read it as: "From df, group by dept, pull out salary, take the mean of each group."

Multiple aggregations at once

Code Block
Python 3.13.2

Or apply different aggregations to different columns:

Code Block
Python 3.13.2

Named aggregations — much cleaner

The (column, function) tuple syntax produces flat, well-named columns. This is the modern recommended style.

Code Block
Python 3.13.2

This pattern — result_name=(column, function) — produces flat column names and self-documents the analysis. Use it.

Grouping by multiple columns

Pass a list of column names:

Code Block
Python 3.13.2

The result has a MultiIndex. To flatten back to a regular DataFrame, call .reset_index():

Code Block
Python 3.13.2

transform — broadcast a group statistic back to rows

Sometimes you do not want a smaller summary table — you want to add a column to the original rows containing a group statistic.

Code Block
Python 3.13.2

Each row now knows the average of its department, and how far it sits from that average. This is a hugely useful pattern for flagging outliers within groups.

filter — keep groups, not rows

A different kind of group operation: keep entire groups that satisfy a condition.

Code Block
Python 3.13.2

Note: filter here takes a function that receives the whole group as a DataFrame and returns True/False. Different from boolean-mask filtering of rows.

apply — for anything else

apply runs an arbitrary function on each group. Slower than the named aggregations, but flexible.

Code Block
Python 3.13.2

Reach for apply only when neither agg nor transform will do.

A realistic example

Compute, for the HR dataset, average monthly income and attrition rate by department.

Code Block
Python 3.13.2

This single chained call answers four business questions at once, per department. That economy is what makes groupby so beloved.

A multi-step challenge

Challenge
Python 3.13.2
Department attrition summary

Load:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

Build a DataFrame called summary indexed by Department with exactly these columns (in this order):

  1. headcount — number of employees in the department
  2. avg_income — mean of MonthlyIncome, rounded to nearest integer
  3. attrition_pct — percentage of employees with Attrition == "Yes", rounded to 1 decimal place

Sort the result by attrition_pct descending.

Check your understanding

QuestionSelect one

The split-apply-combine pattern describes:

A coding style for indentation

A way to import data

The three steps of a groupby: partition rows into groups, compute something per group, stitch the results together

A method of sorting

QuestionSelect one

What is the practical advantage of agg(my_name=("col", "mean")) over agg({"col": ["mean"]})?

It is faster

It supports more functions

It produces flat, well-named output columns instead of a MultiIndex of columns — far easier to use downstream

It works on Series only

QuestionSelect one

When would you use groupby(...).transform(...) instead of .agg(...)?

When you want fewer rows

They are equivalent

When you want a column added back to every original row containing its group's statistic — the result has the same shape as the input

When sorting

QuestionSelect one

In a groupby(["a","b"]) result, what kind of index does the output usually have?

A flat RangeIndex

A column index

A MultiIndex — one level for column a, one level for column b

A DatetimeIndex

On this page