Data Analysis with Python Pandas

History of Data Spreadsheets and Business Rise of Digital Datasets Birth of Data Science The Python and Pandas Story Spreadsheets vs Code

What Data Analysis Is How Computers See Data Rows and Columns The Analyst Mindset

Python for Analysis Notebooks and Environments

Loading Datasets First Look at a Dataset

DataFrames and Series Indexes and Labels Selecting Data loc vs iloc

Filtering Data Sorting and Ranking Creating New Columns

Aggregation Basics GroupBy Operations

Messy Data Overview Missing Values Duplicates and Inconsistencies String Operations Dates and Times

Concat vs Merge Merging and Joining

Wide vs Long Pivot Tables

The EDA Workflow Statistical Summaries Hypothesis Intuition

Visualization Basics Choosing the Right Chart

Reproducible Analysis Debugging Analysis Code Exporting Cleaned Data Project Organization

Merging and Joining

Inner, left, right, and outer joins — what they mean, when to use each, and how to debug surprises.

A join asks: given two tables and a shared key, what should happen to rows that don't match? Pandas's merge answers this with the how parameter.

The four join types

Setting the stage

Code Block

Python 3.13.2

Inner join

Only rows where the key is present in both. This is the default.

Code Block

Python 3.13.2

Esra (dept 99) and HR (dept 40) both vanish. Inner = strict intersection.

Left join

Keep all rows from the left side; fill missing right-side columns with NaN.

Code Block

Python 3.13.2

Esra is preserved with dept_name = NaN. Left joins are by far the most common in everyday analytics — "I have my entities, enrich them with extra info if available."

Right join

The mirror image. Rarely used in practice — most analysts prefer to swap the arguments and use a left join, which reads more naturally.

Code Block

Python 3.13.2

HR appears with NaN employee columns.

Outer join

Keep everything from both sides.

Code Block

Python 3.13.2

Both Esra (no department) and HR (no employees) are present. Great for spotting what doesn't match.

The indicator trick — diagnose join coverage

Code Block

Python 3.13.2

indicator=True adds a _merge column showing where each row came from (left_only, right_only, both). This is one of the most valuable debugging tools in Pandas — always reach for it when a join produces surprising row counts.

Different column names

When the join key is named differently in each table:

Code Block

Python 3.13.2

Many-to-many — the row-count surprise

If the key is not unique on one side, the join multiplies rows. If it's not unique on both sides, you get the Cartesian product per key value — often a disaster.

Code Block

Python 3.13.2

Pandas can warn you about unexpected cardinalities:

Code Block

Python 3.13.2

validate= accepts "one_to_one", "one_to_many", "many_to_one", or "many_to_many" and raises if the actual data doesn't match.

Suffixes when column names collide

Code Block

Python 3.13.2

Always give meaningful suffixes — _x and _y are a recipe for confusion six months later.

A bigger walkthrough

Code Block

Python 3.13.2

This is the everyday detective work of joining real datasets: first check coverage, then narrow down what's missing and why.

Mini challenge

Challenge

Python 3.13.2

Find unmatched orders

Given orders and customers (provided), create a DataFrame orphans containing only the orders whose customer_id does not appear in the customers table.

Use a left-merge + indicator-based filter.

Check your understanding

QuestionSelect one

Your join produces 3× more rows than the left table. The most likely cause is:

A bug in pandas

A network issue

The join key is duplicated in the right-hand table — each left row matches multiple right rows

The dtypes are different

QuestionSelect one

Which join is best for "every employee row, with department info attached if available, but I don't want to lose any employees"?

inner

left (employees on the left)

outer

right

QuestionSelect one

The most useful diagnostic flag for understanding a merge is:

copy=False

sort=True

indicator=True — adds a _merge column showing which rows are left-only, right-only, or matched in both

on="*"

QuestionSelect one

What's the safest way to guarantee your merge is one-to-one before producing results?

Manually count rows after the merge

Hope for the best

Pass validate="one_to_one" to merge — Pandas raises an error if the data violates that assumption

Use a SQL database instead

Concat vs Merge

Two ways to combine datasets — stacking and joining — and how to choose between them.

Wide vs Long

The shape of your data shapes the code you write. Long format unlocks Pandas's superpowers; wide format is easier to read.

On this page

The four join types Setting the stage Inner join Left join Right join Outer join The indicator trick — diagnose join coverage Different column names Many-to-many — the row-count surprise Suffixes when column names collide A bigger walkthrough Mini challenge Check your understanding