Concat vs Merge
Two ways to combine datasets — stacking and joining — and how to choose between them.
Two datasets rarely live alone. You'll often need to combine them. Pandas offers two core tools, and beginners often confuse them.
The mental model
- concat = glue tables together by position. No matching happens. "Here are three months of sales — stack them into one DataFrame."
- merge = bring columns from another table by matching a key. "For each employee, look up their department from the departments table."
Concat — when shapes match
ignore_index=True resets the integer index so it runs 0, 1, 2,
... instead of repeating each month's 0, 1.
You can also concat horizontally:
Horizontal concat is fast but fragile — it lines rows up by
position, not by any key. If the two frames were sorted
differently, you'd silently mis-pair people with names. Prefer
merge whenever a real key exists.
Merge — joining on a key
The result has every employee's row, with dept_name filled in
from the other table by matching dept_id.
A side-by-side example
Notice the natural workflow: concat first (stack the raw data sources), then merge in lookup tables to enrich.
When to use which
| Situation | Tool |
|---|---|
| Same columns, different time periods | concat (axis=0) |
| Same rows in same order, more columns to add | concat (axis=1) — but prefer merge if there's a key |
| Same row entities, columns from another table by key | merge |
| Combining 12 monthly files into one yearly file | concat |
| Adding customer details to each order | merge |
A common trap: misaligned indexes in concat
You get NaNs because the indexes don't overlap. Either reset
indexes first or use merge with an explicit key.
Check your understanding
You have twelve CSV files, one per month, each with the same columns. What's the right tool to combine them into one DataFrame?
merge
concat with axis=0 (stack rows) and ignore_index=True
pivot
join
You have an orders table with a customer_id column, and a separate customers table you want to look up the customer's name from. Which is correct?
concat axis=0
concat axis=1
orders.merge(customers, on="customer_id")
groupby
Why is horizontal pd.concat([a, b], axis=1) considered risky?
It is slow
It does not work with strings
It pairs rows by index/position rather than by a key — easy to silently mis-align rows if the two frames are in different orders
It is deprecated