Missing Values

Missing data is the single most universal property of real-world datasets. This chapter is the playbook for handling it.

What "missing" looks like in Pandas

Pandas uses two sentinel values:

NaN (Not a Number) — for missing floats and, by extension, most "missing" cells.
NaT (Not a Time) — for missing datetimes.

Plus Python's None which Pandas often converts to NaN.

Newer Pandas (1.x+) also offers nullable integer, nullable boolean, and nullable string dtypes (with capital-letter names — Int64, boolean, string) that use a richer sentinel called pd.NA. For this course we stick with the classic NaN/NaT.

Detecting missing values

The basics:

A useful one-liner for a dataset-wide audit:

df.isna().mean().sort_values(ascending=False)

That gives you the fraction of missing values per column, sorted worst-first.

Where do NaNs come from?

Knowing the source guides the fix:

Original data was incomplete. Survey skip, optional form field, sensor dropout.
Sentinel values that were not declared. CSV uses "N/A" or -1; the column got coerced anyway.
Coercion failures. pd.to_numeric(..., errors="coerce") turns unparseable strings into NaN.
Join misses. Left-joining on a key not present in the right table fills the right's columns with NaN.
Reshape artifacts. Pivoting a long table to wide creates NaN wherever the combination did not occur.

The remediation differs for each.

Strategy 1 — drop rows

dropna() removes rows with any missing value:

subset and how give you the fine control you need.

When to drop rows: when missingness is rare (a few percent), random, and the columns are central to your analysis.

When not to: when missingness is high or correlated with something interesting (e.g., income is missing more often for unemployed respondents — dropping them changes the population).

Strategy 2 — drop columns

If a column is mostly NaN, it usually does not help your analysis and may be noise.

Strategy 3 — impute

Fill in a plausible value. The simplest options:

For categorical columns, common choices are:

The mode (most common value).
A literal "missing" or "unknown" label.

For numeric columns:

The mean (sensitive to outliers).
The median (robust).
0 — only if 0 is a meaningful "no contribution" value.
A group-based mean (e.g. department's average) for sophisticated cases.

Imputation hides information

Filling NaNs with the mean lets downstream code run, but it also silently removes evidence of missingness. For most analyses this is fine; for some (where the fact of being missing matters), it is bad. Always ask whether the missingness itself is meaningful before imputing it away.

Strategy 4 — keep it explicit

Sometimes the right move is to leave the NaNs and design downstream code to handle them.

Combining strategies

A real-world cleaning step often mixes:

Detecting "fake" missing values

Some datasets use sentinel values that look like real data but mean "missing." Common offenders:

0, -1, 9999, -9999 for numbers.
"N/A", "unknown", "-", "" for strings.
1900-01-01 or 9999-12-31 for dates.

You catch these by looking at the data. Once spotted, convert them to true NaN at load time:

df = pd.read_csv(path, na_values=["N/A", "-", "9999"])

Or convert after:

df["age"] = df["age"].replace([-1, 999], np.nan)

Mid-page check

QuestionSelect one

A column age contains the values [25, -1, 30, 999, 35]. The values -1 and 999 are sentinel codes from the original system meaning "refused to answer" and "data error." What is the cleanest first step?

Take the mean

Drop the column

Replace the sentinels with NaN (df["age"].replace([-1, 999], np.nan)) so downstream operations treat them as missing

Round them to zero

Validate after cleaning

Whichever strategy you use, end with a quick sanity check:

If the count is not what you expected, fix it. The mistake is almost always a missed column or a sentinel you forgot about.

Check your understanding

QuestionSelect one

When is dropping rows with NaN a poor strategy?

When the dataset is small

Always

When missingness is high or correlated with something interesting (e.g., income is missing more often for certain groups), because dropping changes the effective population

Never

QuestionSelect one

What does s.fillna(s.median()) do?

Replaces NaNs with 0

Removes NaNs from s

Replaces NaNs with the median of the non-null values in s

Computes the median while ignoring NaNs

QuestionSelect one

A categorical column payment_method has 8% missing values. You decide to fill them in. Which is most appropriate?

Fill with 0

Fill with the column's mean

Fill with the mode (most common value) or with a literal label like "unknown"

Fill with NaN

QuestionSelect one

Why might you add a column like score_missing instead of (or in addition to) filling score?

It looks fancy

Pandas requires it

Because the fact that a value was missing can itself be informative — sometimes more informative than any imputed value

It saves memory

What "missing" looks like in Pandas

Detecting missing values

Where do NaNs come from?

Strategy 1 — drop rows

Strategy 2 — drop columns

Strategy 3 — impute

Strategy 4 — keep it explicit

Combining strategies

Detecting "fake" missing values

Mid-page check

Validate after cleaning

Check your understanding

Missing Values

On this page