Filtering Data
Boolean masks — the single most important pattern in Pandas, used dozens of times per analysis.
If there is one Pandas pattern you will use more than any other, it is the boolean mask. Almost every interesting question you ask of a dataset starts with "first, give me the rows where..."
The basic shape
A boolean mask is a Series of True/False values, one per row.
Apply it to a DataFrame and you get back only the rows where the
mask is True.
Two-step or one-step — pick your style:
# Two-step (more readable for complex conditions)
high_earners = df["salary"] > 100
df[high_earners]
# One-step (more compact)
df[df["salary"] > 100]Comparison operators
All the comparisons return a boolean mask:
Combining conditions with &, |, ~
You must use & (and), | (or), ~ (not) — not the Python
keywords and, or, not. And always wrap each condition
in parentheses, because & has higher precedence than >.
If you forget the parentheses you will get a cryptic error about ambiguous truth values. The fix is always the same: add parentheses.
Useful helpers
A handful of methods produce boolean masks directly:
isin — membership
between — ranges
str.contains — substring matching
isna / notna — missing checks
query — a string-based DSL
Pandas has an alternative filter syntax that some people find more readable, especially for multi-condition filters:
Whether you prefer query or boolean masks is mostly a style
choice. Most codebases use masks; query shows up in notebooks.
A real-world filter
You will write filters like this all day, every day, as an
analyst. Cosmetic note: breaking each condition onto its own
line and aligning the &s makes complex filters readable.
Filter, then assign
A common pattern: filter to a subset, then add or modify a
column on just that subset. Use .loc for the assignment:
A filter challenge
Load:
https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv
Then build a DataFrame called young_low_pay containing only employees who:
- Are younger than 30 (
Age < 30), AND - Earn less than
MonthlyIncomeof 3000, AND - Did not leave the company (
Attrition == "No").
Keep all columns.
Check your understanding
Why must you use & instead of and when combining two boolean Series?
and does not exist in Python
It is shorter
and operates on single boolean values and tries to coerce a whole Series into one — which is ambiguous — while & is element-wise on Series
& is faster
Pandas hates English
Which of these is the cleanest way to filter for rows whose dept column is one of "Eng", "Sales", or "Ops"?
df[df["dept"] == "Eng" or df["dept"] == "Sales" or df["dept"] == "Ops"]
df[df["dept"] == ["Eng", "Sales", "Ops"]]
df[df["dept"].isin(["Eng", "Sales", "Ops"])]
df.dept = "Eng" | "Sales" | "Ops"
Why is df.loc[mask, "col"] = value preferred over df[mask]["col"] = value when modifying data?
The two are equivalent
The second is shorter
The chained form may silently operate on a copy and fail to update the original (the SettingWithCopyWarning); loc guarantees in-place update
The second is faster