Dataslope logoDataslope

Duplicates and Inconsistencies

Detecting and resolving the same logical thing recorded twice, three times, or under inconsistent names.

After missing values, the next universal mess is duplication — the same logical record appearing more than once. Closely related: inconsistency, where the same thing wears different labels.

What counts as a duplicate?

It depends on the grain. In a customers table, the same customer ID appearing twice is a duplicate. In an orders table, the same customer ID appearing twice is just a returning customer — not a duplicate.

So the question is always: what columns, taken together, should uniquely identify a row?

Detecting duplicates

Code Block
Python 3.13.2

By default, duplicated() marks the second and later copies as duplicates and leaves the first untouched. keep="last" reverses that; keep=False marks all duplicates.

Removing duplicates

drop_duplicates() is the corresponding action:

Code Block
Python 3.13.2

When you have almost duplicates that disagree on some columns (same customer, different spend in two rows), you have to make a business decision: keep the first, keep the last, sum the spend, average it? Pandas will not decide for you.

A safer pattern

When in doubt, aggregate instead of dropping:

df.groupby("customer_id").agg({"spend": "sum", "name": "first"})

This collapses duplicates while preserving the information you care about.

Inconsistencies in categorical text

Here is where most of the work happens. The same logical category can appear under many syntactic forms.

Code Block
Python 3.13.2

A canonical-mapping cleanup:

Code Block
Python 3.13.2

The pattern — strip → case-normalize → map known variants — covers most categorical cleanup.

Useful string helpers

Series.str exposes vectorized string methods (we will cover them in detail in the next chapter):

MethodDoes
.str.strip()Removes leading/trailing whitespace
.str.lower()Lowercases
.str.upper()Uppercases
.str.title()Title-cases
.str.replaceReplaces substrings (or regex)
.str.containsBoolean mask of substring matches
.str.startswithBoolean mask

Fuzzy duplicates (a peek)

Sometimes records are almost duplicates with subtle differences — "John Smith" vs "Jon Smith". Fully solving this needs string-similarity tools (Levenshtein distance, the rapidfuzz library, record-linkage techniques). For most beginner work, exact-match deduplication is enough; just be aware that fuzzy duplicates exist.

Inconsistent units

A column called price that holds USD in some rows, cents in others, and EUR in still others is a silent disaster. The fix is always: convert to one unit early, document the source unit, keep a column for the original currency if useful.

Code Block
Python 3.13.2

You will write code like this every time you receive data from a third party.

Duplicate index labels

A subtle one: a duplicated index can break alignment and joins. Check with:

Code Block
Python 3.13.2

If your index is supposed to be a unique identifier, assert it:

assert df.index.is_unique, "expected unique index"

A failed assertion at the top of a notebook is much friendlier than a baffling alignment bug forty cells later.

Check your understanding

QuestionSelect one

In a table where each row should represent one customer, two rows have the same customer_id but slightly different name and spend. What is the most defensible response?

Always drop the second copy

Always sum the spend

Decide based on the business rules (keep first? keep latest? aggregate?) and document the choice; do not blindly drop or sum

Ignore them

QuestionSelect one

Which sequence is the chapter's recommended general approach to cleaning inconsistent categorical labels?

Drop the column, recreate, paste back

One-hot encode everything

Strip whitespace → normalize case → replace known variants with a canonical form

Convert to numeric

QuestionSelect one

Why does a column with inconsistent units ("USD", "cents", "EUR" all in one price field) cause silent bugs?

It causes the file to fail to open

It is too small

Aggregations like mean, sum, comparisons, and rankings will mix incompatible numbers, producing meaningless results without any error

It uses too much memory

On this page