Missing Values
NaN, NaT, None — what they are, where they come from, how Pandas treats them, and the four strategies for dealing with them.
Missing data is the single most universal property of real-world datasets. This chapter is the playbook for handling it.
What "missing" looks like in Pandas
Pandas uses two sentinel values:
NaN(Not a Number) — for missing floats and, by extension, most "missing" cells.NaT(Not a Time) — for missing datetimes.
Plus Python's None which Pandas often converts to NaN.
Newer Pandas (1.x+) also offers nullable integer, nullable
boolean, and nullable string dtypes (with capital-letter
names — Int64, boolean, string) that use a richer sentinel
called pd.NA. For this course we stick with the classic
NaN/NaT.
Detecting missing values
The basics:
A useful one-liner for a dataset-wide audit:
df.isna().mean().sort_values(ascending=False)That gives you the fraction of missing values per column, sorted worst-first.
Where do NaNs come from?
Knowing the source guides the fix:
- Original data was incomplete. Survey skip, optional form field, sensor dropout.
- Sentinel values that were not declared. CSV uses
"N/A"or-1; the column got coerced anyway. - Coercion failures.
pd.to_numeric(..., errors="coerce")turns unparseable strings into NaN. - Join misses. Left-joining on a key not present in the right table fills the right's columns with NaN.
- Reshape artifacts. Pivoting a long table to wide creates NaN wherever the combination did not occur.
The remediation differs for each.
Strategy 1 — drop rows
dropna() removes rows with any missing value:
subset and how give you the fine control you need.
When to drop rows: when missingness is rare (a few percent), random, and the columns are central to your analysis.
When not to: when missingness is high or correlated with something interesting (e.g., income is missing more often for unemployed respondents — dropping them changes the population).
Strategy 2 — drop columns
If a column is mostly NaN, it usually does not help your analysis and may be noise.
Strategy 3 — impute
Fill in a plausible value. The simplest options:
For categorical columns, common choices are:
- The mode (most common value).
- A literal
"missing"or"unknown"label.
For numeric columns:
- The mean (sensitive to outliers).
- The median (robust).
- 0 — only if 0 is a meaningful "no contribution" value.
- A group-based mean (e.g. department's average) for sophisticated cases.
Imputation hides information
Filling NaNs with the mean lets downstream code run, but it also silently removes evidence of missingness. For most analyses this is fine; for some (where the fact of being missing matters), it is bad. Always ask whether the missingness itself is meaningful before imputing it away.
Strategy 4 — keep it explicit
Sometimes the right move is to leave the NaNs and design downstream code to handle them.
Combining strategies
A real-world cleaning step often mixes:
Detecting "fake" missing values
Some datasets use sentinel values that look like real data but mean "missing." Common offenders:
0,-1,9999,-9999for numbers."N/A","unknown","-",""for strings.1900-01-01or9999-12-31for dates.
You catch these by looking at the data. Once spotted, convert them to true NaN at load time:
df = pd.read_csv(path, na_values=["N/A", "-", "9999"])Or convert after:
df["age"] = df["age"].replace([-1, 999], np.nan)Mid-page check
A column age contains the values [25, -1, 30, 999, 35]. The values -1 and 999 are sentinel codes from the original system meaning "refused to answer" and "data error." What is the cleanest first step?
Take the mean
Drop the column
Replace the sentinels with NaN (df["age"].replace([-1, 999], np.nan)) so downstream operations treat them as missing
Round them to zero
Validate after cleaning
Whichever strategy you use, end with a quick sanity check:
If the count is not what you expected, fix it. The mistake is almost always a missed column or a sentinel you forgot about.
Check your understanding
When is dropping rows with NaN a poor strategy?
When the dataset is small
Always
When missingness is high or correlated with something interesting (e.g., income is missing more often for certain groups), because dropping changes the effective population
Never
What does s.fillna(s.median()) do?
Replaces NaNs with 0
Removes NaNs from s
Replaces NaNs with the median of the non-null values in s
Computes the median while ignoring NaNs
A categorical column payment_method has 8% missing values. You decide to fill them in. Which is most appropriate?
Fill with 0
Fill with the column's mean
Fill with the mode (most common value) or with a literal label like "unknown"
Fill with NaN
Why might you add a column like score_missing instead of (or in addition to) filling score?
It looks fancy
Pandas requires it
Because the fact that a value was missing can itself be informative — sometimes more informative than any imputed value
It saves memory