Data Analysis with Python Pandas

History of Data Spreadsheets and Business Rise of Digital Datasets Birth of Data Science The Python and Pandas Story Spreadsheets vs Code

What Data Analysis Is How Computers See Data Rows and Columns The Analyst Mindset

Python for Analysis Notebooks and Environments

Loading Datasets First Look at a Dataset

DataFrames and Series Indexes and Labels Selecting Data loc vs iloc

Filtering Data Sorting and Ranking Creating New Columns

Aggregation Basics GroupBy Operations

Messy Data Overview Missing Values Duplicates and Inconsistencies String Operations Dates and Times

Concat vs Merge Merging and Joining

Wide vs Long Pivot Tables

The EDA Workflow Statistical Summaries Hypothesis Intuition

Visualization Basics Choosing the Right Chart

Reproducible Analysis Debugging Analysis Code Exporting Cleaned Data Project Organization

GroupBy Operations

Split-apply-combine — the most powerful pattern in Pandas, and arguably in all of data analysis.

If there is one operation that earns Pandas its keep, it is groupby. Every "by department," "by month," "by user," "by category" question in data analysis ends up here.

Split-apply-combine

The mental model is three steps:

Split — partition rows into groups based on a column's values.
Apply — compute something on each group (a mean, a sum, a custom function).
Combine — stitch the per-group results back into one result.

Hadley Wickham named this pattern split-apply-combine in 2011, and it has been the conceptual backbone of tabular data analysis ever since.

The basic shape

Code Block

Python 3.13.2

Read it as: "From df, group by dept, pull out salary, take the mean of each group."

Multiple aggregations at once

Code Block

Python 3.13.2

Or apply different aggregations to different columns:

Code Block

Python 3.13.2

Named aggregations — much cleaner

The (column, function) tuple syntax produces flat, well-named columns. This is the modern recommended style.

Code Block

Python 3.13.2

This pattern — result_name=(column, function) — produces flat column names and self-documents the analysis. Use it.

Grouping by multiple columns

Pass a list of column names:

Code Block

Python 3.13.2

The result has a MultiIndex. To flatten back to a regular DataFrame, call .reset_index():

Code Block

Python 3.13.2

`transform` — broadcast a group statistic back to rows

Sometimes you do not want a smaller summary table — you want to add a column to the original rows containing a group statistic.

Code Block

Python 3.13.2

Each row now knows the average of its department, and how far it sits from that average. This is a hugely useful pattern for flagging outliers within groups.

`filter` — keep groups, not rows

A different kind of group operation: keep entire groups that satisfy a condition.

Code Block

Python 3.13.2

Note: filter here takes a function that receives the whole group as a DataFrame and returns True/False. Different from boolean-mask filtering of rows.

`apply` — for anything else

apply runs an arbitrary function on each group. Slower than the named aggregations, but flexible.

Code Block

Python 3.13.2

Reach for apply only when neither agg nor transform will do.

A realistic example

Compute, for the HR dataset, average monthly income and attrition rate by department.

Code Block

Python 3.13.2

This single chained call answers four business questions at once, per department. That economy is what makes groupby so beloved.

A multi-step challenge

Challenge

Python 3.13.2

Department attrition summary

Load:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

Build a DataFrame called summary indexed by Department with exactly these columns (in this order):

headcount — number of employees in the department
avg_income — mean of MonthlyIncome, rounded to nearest integer
attrition_pct — percentage of employees with Attrition == "Yes", rounded to 1 decimal place

Sort the result by attrition_pct descending.

Check your understanding

QuestionSelect one

The split-apply-combine pattern describes:

A coding style for indentation

A way to import data

The three steps of a groupby: partition rows into groups, compute something per group, stitch the results together

A method of sorting

QuestionSelect one

What is the practical advantage of agg(my_name=("col", "mean")) over agg({"col": ["mean"]})?

It is faster

It supports more functions

It produces flat, well-named output columns instead of a MultiIndex of columns — far easier to use downstream

It works on Series only

QuestionSelect one

When would you use groupby(...).transform(...) instead of .agg(...)?

When you want fewer rows

They are equivalent

When you want a column added back to every original row containing its group's statistic — the result has the same shape as the input

When sorting

QuestionSelect one

In a groupby(["a","b"]) result, what kind of index does the output usually have?

A flat RangeIndex

A column index

A MultiIndex — one level for column a, one level for column b

A DatetimeIndex

Aggregation Basics

Sum, mean, median, min, max — the simple reductions that turn many rows into one number, and the subtle choices that change what they mean.

Messy Data Overview

A field guide to the kinds of mess you will find in real datasets, and the mindset that turns cleaning from a chore into a skill.

On this page

Split-apply-combine The basic shape Multiple aggregations at once Named aggregations — much cleaner Grouping by multiple columns transform — broadcast a group statistic back to rows filter — keep groups, not rows apply — for anything else A realistic example A multi-step challenge Check your understanding