Duplicates and Inconsistencies

Detecting and resolving the same logical thing recorded twice, three times, or under inconsistent names.

After missing values, the next universal mess is duplication — the same logical record appearing more than once. Closely related: inconsistency, where the same thing wears different labels.

What counts as a duplicate?

It depends on the grain. In a customers table, the same customer ID appearing twice is a duplicate. In an orders table, the same customer ID appearing twice is just a returning customer — not a duplicate.

So the question is always: what columns, taken together, should uniquely identify a row?

Detecting duplicates

By default, duplicated() marks the second and later copies as duplicates and leaves the first untouched. keep="last" reverses that; keep=False marks all duplicates.

Removing duplicates

drop_duplicates() is the corresponding action:

When you have almost duplicates that disagree on some columns (same customer, different spend in two rows), you have to make a business decision: keep the first, keep the last, sum the spend, average it? Pandas will not decide for you.

A safer pattern

When in doubt, aggregate instead of dropping:

df.groupby("customer_id").agg({"spend": "sum", "name": "first"})

This collapses duplicates while preserving the information you care about.

Inconsistencies in categorical text

Here is where most of the work happens. The same logical category can appear under many syntactic forms.

A canonical-mapping cleanup:

The pattern — strip → case-normalize → map known variants — covers most categorical cleanup.

Useful string helpers

Series.str exposes vectorized string methods (we will cover them in detail in the next chapter):

Method	Does
`.str.strip()`	Removes leading/trailing whitespace
`.str.lower()`	Lowercases
`.str.upper()`	Uppercases
`.str.title()`	Title-cases
`.str.replace`	Replaces substrings (or regex)
`.str.contains`	Boolean mask of substring matches
`.str.startswith`	Boolean mask

Fuzzy duplicates (a peek)

Sometimes records are almost duplicates with subtle differences — "John Smith" vs "Jon Smith". Fully solving this needs string-similarity tools (Levenshtein distance, the rapidfuzz library, record-linkage techniques). For most beginner work, exact-match deduplication is enough; just be aware that fuzzy duplicates exist.

Inconsistent units

A column called price that holds USD in some rows, cents in others, and EUR in still others is a silent disaster. The fix is always: convert to one unit early, document the source unit, keep a column for the original currency if useful.

You will write code like this every time you receive data from a third party.

Duplicate index labels

A subtle one: a duplicated index can break alignment and joins. Check with:

If your index is supposed to be a unique identifier, assert it:

assert df.index.is_unique, "expected unique index"

A failed assertion at the top of a notebook is much friendlier than a baffling alignment bug forty cells later.

Check your understanding

QuestionSelect one

In a table where each row should represent one customer, two rows have the same customer_id but slightly different name and spend. What is the most defensible response?

Always drop the second copy

Always sum the spend

Decide based on the business rules (keep first? keep latest? aggregate?) and document the choice; do not blindly drop or sum

Ignore them

QuestionSelect one

Which sequence is the chapter's recommended general approach to cleaning inconsistent categorical labels?

Drop the column, recreate, paste back

One-hot encode everything

Strip whitespace → normalize case → replace known variants with a canonical form

Convert to numeric

QuestionSelect one

Why does a column with inconsistent units ("USD", "cents", "EUR" all in one price field) cause silent bugs?

It causes the file to fail to open

It is too small

Aggregations like mean, sum, comparisons, and rankings will mix incompatible numbers, producing meaningless results without any error

It uses too much memory

Missing Values

NaN, NaT, None — what they are, where they come from, how Pandas treats them, and the four strategies for dealing with them.

String Operations

The .str accessor — vectorized string methods that operate on whole columns of text without writing a loop.

Duplicates and Inconsistencies

On this page