Filtering Data

Boolean masks — the single most important pattern in Pandas, used dozens of times per analysis.

If there is one Pandas pattern you will use more than any other, it is the boolean mask. Almost every interesting question you ask of a dataset starts with "first, give me the rows where..."

The basic shape

A boolean mask is a Series of True/False values, one per row. Apply it to a DataFrame and you get back only the rows where the mask is True.

Two-step or one-step — pick your style:

# Two-step (more readable for complex conditions)
high_earners = df["salary"] > 100
df[high_earners]

# One-step (more compact)
df[df["salary"] > 100]

Comparison operators

All the comparisons return a boolean mask:

Combining conditions with `&`, `|`, `~`

You must use & (and), | (or), ~ (not) — not the Python keywords and, or, not. And always wrap each condition in parentheses, because & has higher precedence than >.

If you forget the parentheses you will get a cryptic error about ambiguous truth values. The fix is always the same: add parentheses.

Useful helpers

A handful of methods produce boolean masks directly:

`isin` — membership

`between` — ranges

`str.contains` — substring matching

`isna` / `notna` — missing checks

`query` — a string-based DSL

Pandas has an alternative filter syntax that some people find more readable, especially for multi-condition filters:

Whether you prefer query or boolean masks is mostly a style choice. Most codebases use masks; query shows up in notebooks.

A real-world filter

You will write filters like this all day, every day, as an analyst. Cosmetic note: breaking each condition onto its own line and aligning the &s makes complex filters readable.

Filter, then assign

A common pattern: filter to a subset, then add or modify a column on just that subset. Use .loc for the assignment:

A filter challenge

Load:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

Then build a DataFrame called young_low_pay containing only employees who:

Are younger than 30 (Age < 30), AND
Earn less than MonthlyIncome of 3000, AND
Did not leave the company (Attrition == "No").

Keep all columns.

Check your understanding

QuestionSelect one

Why must you use & instead of and when combining two boolean Series?

and does not exist in Python

It is shorter

and operates on single boolean values and tries to coerce a whole Series into one — which is ambiguous — while & is element-wise on Series

& is faster

Pandas hates English

QuestionSelect one

Which of these is the cleanest way to filter for rows whose dept column is one of "Eng", "Sales", or "Ops"?

df[df["dept"] == "Eng" or df["dept"] == "Sales" or df["dept"] == "Ops"]

df[df["dept"] == ["Eng", "Sales", "Ops"]]

df[df["dept"].isin(["Eng", "Sales", "Ops"])]

df.dept = "Eng" | "Sales" | "Ops"

QuestionSelect one

Why is df.loc[mask, "col"] = value preferred over df[mask]["col"] = value when modifying data?

The two are equivalent

The second is shorter

The chained form may silently operate on a copy and fail to update the original (the SettingWithCopyWarning); loc guarantees in-place update

The second is faster

loc vs iloc

The single most asked Pandas question on the internet, finally settled with side-by-side examples.

Sorting and Ranking

Ordering rows by one or more columns, and assigning ranks within groups.

The basic shape Comparison operators Combining conditions with &, |, ~Useful helpers isin — membership between — ranges str.contains — substring matching isna / notna — missing checks query — a string-based DSL A real-world filter Filter, then assign A filter challenge Check your understanding

Filtering Data

On this page