Messy Data Overview
A field guide to the kinds of mess you will find in real datasets, and the mindset that turns cleaning from a chore into a skill.
Textbook datasets are clean. Real datasets are not. This chapter is a map of the messy territory we will spend the next several pages exploring.
What "messy" actually means
A messy dataset suffers from one or more of:
| Mess | Example |
|---|---|
| Missing values | NaN, empty strings, "N/A", "-", 0-as-null |
| Duplicates | Same logical row recorded twice |
| Inconsistent categories | "USA", "U.S.", "United States" |
| Inconsistent casing | "Sales", "sales", "SALES" |
| Typos | "Engineering", "Enginering" |
| Wrong types | Numbers stored as strings, ZIPs as ints |
| Bad encodings | "Cafe\u00e9" printing as "Café" |
| Outliers / impossibles | Age of 200; salary of $999,999,999 |
| Sentinel values | -1, 0, or 9999 used to mean "missing" |
| Date format chaos | "3/4/24" (US? UK?), "2024-03-04", etc. |
| Trailing whitespace | " Sales " vs "Sales" |
| Inconsistent units | Some prices in USD, some in cents, some in EUR |
| Grain confusion | One-row-per-customer mixed with one-row-per-order |
Almost every real-world dataset has several of these. The analyst's job is to detect them, decide what to do, and document what you did.
A canonical workflow
The loop is the important part. Validation often reveals issues that detection missed, so cleaning is iterative.
Three classes of problem, three classes of response
A useful taxonomy:
1. Wrong data → fix it
If you can correct the value, do it: parse a date string, normalize a category, coerce a string-number to a number.
2. Missing data → decide
You usually have four choices:
- Drop rows with missing values.
- Drop the whole column if it is mostly missing.
- Impute — fill in a plausible value (mean, median, 0, a flag, or a predicted value).
- Keep as NaN and ensure downstream code handles it.
The right choice depends on why the value is missing and what the downstream use is. We will look at this in detail.
3. Suspect data → flag
When you cannot tell whether a value is wrong but it looks
weird, flag it rather than silently fixing or dropping. A
column like is_outlier lets reviewers see what you noticed.
The most important habit
Never modify the raw data file. Always keep your raw input untouched and write your cleaning as code that transforms raw → clean. If your cleaning was wrong, you can re-run it; if you overwrote the raw file, you cannot.
A taste of every flavor of mess
The following toy dataset has at least one example of nearly every problem in the table above. Read it carefully and try to spot each issue before reading on.
What did you spot?
- Duplicate row (customer 5 appears twice).
- Whitespace ("chen ").
- Casing ("DIEGO", "uk").
- Country inconsistency (USA / U.S.A. / United States / U.S.).
- Date format chaos (ISO, slash, MM/DD).
- An impossible date (Feb 29, 2024 is real, but Feb 30 is not — example of values that look fine but break).
spendas strings (because"abc"and""poison the type inference).spendoutlier (9,999,999 is probably wrong).spendnegatives (refunds? data entry?).- Missing
name(None). - Sentinel
"??"in a date column. - Missing
tier(None).
In other words: a single 9-row dataset already has roughly a dozen problems.
Walking through a small clean
A taste — we will not fix everything here, just demonstrate the pattern.
The remaining mess (missing names, suspicious spends) is now flagged and typed — we can decide what to do with each in later analysis steps.
The cleaning manifesto
Three principles to internalize:
- Preserve the raw. Never overwrite the source file. Your raw → clean pipeline should be a function of the raw input.
- Document every decision. Each cleaning step should have a comment explaining why. Not for show — for the next analyst, who is often you in three months.
- Make every step a small testable transformation. Avoid one giant cleaning blob; break it into named steps so you can re-run from any point.
Check your understanding
Why does the chapter recommend never overwriting your raw data file with the cleaned version?
It is a Pandas requirement
Raw files are bigger
If your cleaning is wrong, you can re-run from the untouched raw input; if you overwrote it, you cannot recover the original
Raw files are encrypted
A column contains "USA", "U.S.A.", "United States", and "U.S." — all meaning the same country. What is the best one-word description?
Duplicate values
Missing values
Inconsistent categories
Wrong types
What does the chapter mean by the "flag" approach to suspect data?
Removing the rows
Replacing the values with NaN
Keeping the original values but adding a boolean column that marks them as suspicious, so reviewers can see your judgment without losing information
Coloring the cells