Duplicates and Inconsistencies
Detecting and resolving the same logical thing recorded twice, three times, or under inconsistent names.
After missing values, the next universal mess is duplication — the same logical record appearing more than once. Closely related: inconsistency, where the same thing wears different labels.
What counts as a duplicate?
It depends on the grain. In a customers table, the same customer ID appearing twice is a duplicate. In an orders table, the same customer ID appearing twice is just a returning customer — not a duplicate.
So the question is always: what columns, taken together, should uniquely identify a row?
Detecting duplicates
By default, duplicated() marks the second and later copies
as duplicates and leaves the first untouched. keep="last"
reverses that; keep=False marks all duplicates.
Removing duplicates
drop_duplicates() is the corresponding action:
When you have almost duplicates that disagree on some columns (same customer, different spend in two rows), you have to make a business decision: keep the first, keep the last, sum the spend, average it? Pandas will not decide for you.
A safer pattern
When in doubt, aggregate instead of dropping:
df.groupby("customer_id").agg({"spend": "sum", "name": "first"})
This collapses duplicates while preserving the information you care about.
Inconsistencies in categorical text
Here is where most of the work happens. The same logical category can appear under many syntactic forms.
A canonical-mapping cleanup:
The pattern — strip → case-normalize → map known variants — covers most categorical cleanup.
Useful string helpers
Series.str exposes vectorized string methods (we will cover
them in detail in the next chapter):
| Method | Does |
|---|---|
.str.strip() | Removes leading/trailing whitespace |
.str.lower() | Lowercases |
.str.upper() | Uppercases |
.str.title() | Title-cases |
.str.replace | Replaces substrings (or regex) |
.str.contains | Boolean mask of substring matches |
.str.startswith | Boolean mask |
Fuzzy duplicates (a peek)
Sometimes records are almost duplicates with subtle
differences — "John Smith" vs "Jon Smith". Fully solving
this needs string-similarity tools (Levenshtein distance, the
rapidfuzz library, record-linkage techniques). For most
beginner work, exact-match deduplication is enough; just be
aware that fuzzy duplicates exist.
Inconsistent units
A column called price that holds USD in some rows, cents in
others, and EUR in still others is a silent disaster. The fix is
always: convert to one unit early, document the source unit,
keep a column for the original currency if useful.
You will write code like this every time you receive data from a third party.
Duplicate index labels
A subtle one: a duplicated index can break alignment and joins. Check with:
If your index is supposed to be a unique identifier, assert it:
assert df.index.is_unique, "expected unique index"A failed assertion at the top of a notebook is much friendlier than a baffling alignment bug forty cells later.
Check your understanding
In a table where each row should represent one customer, two rows have the same customer_id but slightly different name and spend. What is the most defensible response?
Always drop the second copy
Always sum the spend
Decide based on the business rules (keep first? keep latest? aggregate?) and document the choice; do not blindly drop or sum
Ignore them
Which sequence is the chapter's recommended general approach to cleaning inconsistent categorical labels?
Drop the column, recreate, paste back
One-hot encode everything
Strip whitespace → normalize case → replace known variants with a canonical form
Convert to numeric
Why does a column with inconsistent units ("USD", "cents", "EUR" all in one price field) cause silent bugs?
It causes the file to fail to open
It is too small
Aggregations like mean, sum, comparisons, and rankings will mix incompatible numbers, producing meaningless results without any error
It uses too much memory