Merging and Joining
Inner, left, right, and outer joins — what they mean, when to use each, and how to debug surprises.
A join asks: given two tables and a shared key, what should
happen to rows that don't match? Pandas's merge answers this
with the how parameter.
The four join types
Setting the stage
Inner join
Only rows where the key is present in both. This is the default.
Esra (dept 99) and HR (dept 40) both vanish. Inner = strict intersection.
Left join
Keep all rows from the left side; fill missing right-side columns with NaN.
Esra is preserved with dept_name = NaN. Left joins are by far
the most common in everyday analytics — "I have my entities,
enrich them with extra info if available."
Right join
The mirror image. Rarely used in practice — most analysts prefer to swap the arguments and use a left join, which reads more naturally.
HR appears with NaN employee columns.
Outer join
Keep everything from both sides.
Both Esra (no department) and HR (no employees) are present. Great for spotting what doesn't match.
The indicator trick — diagnose join coverage
indicator=True adds a _merge column showing where each row
came from (left_only, right_only, both). This is one of
the most valuable debugging tools in Pandas — always reach for
it when a join produces surprising row counts.
Different column names
When the join key is named differently in each table:
Many-to-many — the row-count surprise
If the key is not unique on one side, the join multiplies rows. If it's not unique on both sides, you get the Cartesian product per key value — often a disaster.
Pandas can warn you about unexpected cardinalities:
validate= accepts "one_to_one", "one_to_many",
"many_to_one", or "many_to_many" and raises if the actual
data doesn't match.
Suffixes when column names collide
Always give meaningful suffixes — _x and _y are a recipe for
confusion six months later.
A bigger walkthrough
This is the everyday detective work of joining real datasets: first check coverage, then narrow down what's missing and why.
Mini challenge
Given orders and customers (provided), create a DataFrame orphans containing only the orders whose customer_id does not appear in the customers table.
Use a left-merge + indicator-based filter.
Check your understanding
Your join produces 3× more rows than the left table. The most likely cause is:
A bug in pandas
A network issue
The join key is duplicated in the right-hand table — each left row matches multiple right rows
The dtypes are different
Which join is best for "every employee row, with department info attached if available, but I don't want to lose any employees"?
inner
left (employees on the left)
outer
right
The most useful diagnostic flag for understanding a merge is:
copy=False
sort=True
indicator=True — adds a _merge column showing which rows are left-only, right-only, or matched in both
on="*"
What's the safest way to guarantee your merge is one-to-one before producing results?
Manually count rows after the merge
Hope for the best
Pass validate="one_to_one" to merge — Pandas raises an error if the data violates that assumption
Use a SQL database instead