Messy Data Overview

A field guide to the kinds of mess you will find in real datasets, and the mindset that turns cleaning from a chore into a skill.

Textbook datasets are clean. Real datasets are not. This chapter is a map of the messy territory we will spend the next several pages exploring.

What "messy" actually means

A messy dataset suffers from one or more of:

Mess	Example
Missing values	NaN, empty strings, `"N/A"`, `"-"`, `0`-as-null
Duplicates	Same logical row recorded twice
Inconsistent categories	`"USA"`, `"U.S."`, `"United States"`
Inconsistent casing	`"Sales"`, `"sales"`, `"SALES"`
Typos	`"Engineering"`, `"Enginering"`
Wrong types	Numbers stored as strings, ZIPs as ints
Bad encodings	`"Cafe\u00e9"` printing as `"CafÃ©"`
Outliers / impossibles	Age of 200; salary of $999,999,999
Sentinel values	`-1`, `0`, or `9999` used to mean "missing"
Date format chaos	`"3/4/24"` (US? UK?), `"2024-03-04"`, etc.
Trailing whitespace	`" Sales "` vs `"Sales"`
Inconsistent units	Some prices in USD, some in cents, some in EUR
Grain confusion	One-row-per-customer mixed with one-row-per-order

Almost every real-world dataset has several of these. The analyst's job is to detect them, decide what to do, and document what you did.

A canonical workflow

The loop is the important part. Validation often reveals issues that detection missed, so cleaning is iterative.

Three classes of problem, three classes of response

A useful taxonomy:

1. Wrong data → fix it

If you can correct the value, do it: parse a date string, normalize a category, coerce a string-number to a number.

2. Missing data → decide

You usually have four choices:

Drop rows with missing values.
Drop the whole column if it is mostly missing.
Impute — fill in a plausible value (mean, median, 0, a flag, or a predicted value).
Keep as NaN and ensure downstream code handles it.

The right choice depends on why the value is missing and what the downstream use is. We will look at this in detail.

3. Suspect data → flag

When you cannot tell whether a value is wrong but it looks weird, flag it rather than silently fixing or dropping. A column like is_outlier lets reviewers see what you noticed.

The most important habit

Never modify the raw data file. Always keep your raw input untouched and write your cleaning as code that transforms raw → clean. If your cleaning was wrong, you can re-run it; if you overwrote the raw file, you cannot.

A taste of every flavor of mess

The following toy dataset has at least one example of nearly every problem in the table above. Read it carefully and try to spot each issue before reading on.

What did you spot?

Duplicate row (customer 5 appears twice).
Whitespace ("chen ").
Casing ("DIEGO", "uk").
Country inconsistency (USA / U.S.A. / United States / U.S.).
Date format chaos (ISO, slash, MM/DD).
An impossible date (Feb 29, 2024 is real, but Feb 30 is not — example of values that look fine but break).
spend as strings (because "abc" and "" poison the type inference).
spend outlier (9,999,999 is probably wrong).
spend negatives (refunds? data entry?).
Missing name (None).
Sentinel "??" in a date column.
Missing tier (None).

In other words: a single 9-row dataset already has roughly a dozen problems.

Walking through a small clean

A taste — we will not fix everything here, just demonstrate the pattern.

The remaining mess (missing names, suspicious spends) is now flagged and typed — we can decide what to do with each in later analysis steps.

The cleaning manifesto

Three principles to internalize:

Preserve the raw. Never overwrite the source file. Your raw → clean pipeline should be a function of the raw input.
Document every decision. Each cleaning step should have a comment explaining why. Not for show — for the next analyst, who is often you in three months.
Make every step a small testable transformation. Avoid one giant cleaning blob; break it into named steps so you can re-run from any point.

Check your understanding

QuestionSelect one

Why does the chapter recommend never overwriting your raw data file with the cleaned version?

It is a Pandas requirement

Raw files are bigger

If your cleaning is wrong, you can re-run from the untouched raw input; if you overwrote it, you cannot recover the original

Raw files are encrypted

QuestionSelect one

A column contains "USA", "U.S.A.", "United States", and "U.S." — all meaning the same country. What is the best one-word description?

Duplicate values

Missing values

Inconsistent categories

Wrong types

QuestionSelect one

What does the chapter mean by the "flag" approach to suspect data?

Removing the rows

Replacing the values with NaN

Keeping the original values but adding a boolean column that marks them as suspicious, so reviewers can see your judgment without losing information

Coloring the cells

GroupBy Operations

Split-apply-combine — the most powerful pattern in Pandas, and arguably in all of data analysis.

Missing Values

NaN, NaT, None — what they are, where they come from, how Pandas treats them, and the four strategies for dealing with them.

Messy Data Overview

On this page