Dataslope logoDataslope

Merging and Joining

Inner, left, right, and outer joins — what they mean, when to use each, and how to debug surprises.

A join asks: given two tables and a shared key, what should happen to rows that don't match? Pandas's merge answers this with the how parameter.

The four join types

Setting the stage

Code Block
Python 3.13.2

Inner join

Only rows where the key is present in both. This is the default.

Code Block
Python 3.13.2

Esra (dept 99) and HR (dept 40) both vanish. Inner = strict intersection.

Left join

Keep all rows from the left side; fill missing right-side columns with NaN.

Code Block
Python 3.13.2

Esra is preserved with dept_name = NaN. Left joins are by far the most common in everyday analytics — "I have my entities, enrich them with extra info if available."

Right join

The mirror image. Rarely used in practice — most analysts prefer to swap the arguments and use a left join, which reads more naturally.

Code Block
Python 3.13.2

HR appears with NaN employee columns.

Outer join

Keep everything from both sides.

Code Block
Python 3.13.2

Both Esra (no department) and HR (no employees) are present. Great for spotting what doesn't match.

The indicator trick — diagnose join coverage

Code Block
Python 3.13.2

indicator=True adds a _merge column showing where each row came from (left_only, right_only, both). This is one of the most valuable debugging tools in Pandas — always reach for it when a join produces surprising row counts.

Different column names

When the join key is named differently in each table:

Code Block
Python 3.13.2

Many-to-many — the row-count surprise

If the key is not unique on one side, the join multiplies rows. If it's not unique on both sides, you get the Cartesian product per key value — often a disaster.

Code Block
Python 3.13.2

Pandas can warn you about unexpected cardinalities:

Code Block
Python 3.13.2

validate= accepts "one_to_one", "one_to_many", "many_to_one", or "many_to_many" and raises if the actual data doesn't match.

Suffixes when column names collide

Code Block
Python 3.13.2

Always give meaningful suffixes — _x and _y are a recipe for confusion six months later.

A bigger walkthrough

Code Block
Python 3.13.2

This is the everyday detective work of joining real datasets: first check coverage, then narrow down what's missing and why.

Mini challenge

Challenge
Python 3.13.2
Find unmatched orders

Given orders and customers (provided), create a DataFrame orphans containing only the orders whose customer_id does not appear in the customers table.

Use a left-merge + indicator-based filter.

Check your understanding

QuestionSelect one

Your join produces 3× more rows than the left table. The most likely cause is:

A bug in pandas

A network issue

The join key is duplicated in the right-hand table — each left row matches multiple right rows

The dtypes are different

QuestionSelect one

Which join is best for "every employee row, with department info attached if available, but I don't want to lose any employees"?

inner

left (employees on the left)

outer

right

QuestionSelect one

The most useful diagnostic flag for understanding a merge is:

copy=False

sort=True

indicator=True — adds a _merge column showing which rows are left-only, right-only, or matched in both

on="*"

QuestionSelect one

What's the safest way to guarantee your merge is one-to-one before producing results?

Manually count rows after the merge

Hope for the best

Pass validate="one_to_one" to merge — Pandas raises an error if the data violates that assumption

Use a SQL database instead

On this page