Dataslope logoDataslope

Filtering Data

Boolean masks — the single most important pattern in Pandas, used dozens of times per analysis.

If there is one Pandas pattern you will use more than any other, it is the boolean mask. Almost every interesting question you ask of a dataset starts with "first, give me the rows where..."

The basic shape

A boolean mask is a Series of True/False values, one per row. Apply it to a DataFrame and you get back only the rows where the mask is True.

Code Block
Python 3.13.2

Two-step or one-step — pick your style:

# Two-step (more readable for complex conditions)
high_earners = df["salary"] > 100
df[high_earners]

# One-step (more compact)
df[df["salary"] > 100]

Comparison operators

All the comparisons return a boolean mask:

Code Block
Python 3.13.2

Combining conditions with &, |, ~

You must use & (and), | (or), ~ (not) — not the Python keywords and, or, not. And always wrap each condition in parentheses, because & has higher precedence than >.

Code Block
Python 3.13.2

If you forget the parentheses you will get a cryptic error about ambiguous truth values. The fix is always the same: add parentheses.

Useful helpers

A handful of methods produce boolean masks directly:

isin — membership

Code Block
Python 3.13.2

between — ranges

Code Block
Python 3.13.2

str.contains — substring matching

Code Block
Python 3.13.2

isna / notna — missing checks

Code Block
Python 3.13.2

query — a string-based DSL

Pandas has an alternative filter syntax that some people find more readable, especially for multi-condition filters:

Code Block
Python 3.13.2

Whether you prefer query or boolean masks is mostly a style choice. Most codebases use masks; query shows up in notebooks.

A real-world filter

Code Block
Python 3.13.2

You will write filters like this all day, every day, as an analyst. Cosmetic note: breaking each condition onto its own line and aligning the &s makes complex filters readable.

Filter, then assign

A common pattern: filter to a subset, then add or modify a column on just that subset. Use .loc for the assignment:

Code Block
Python 3.13.2

A filter challenge

Challenge
Python 3.13.2
Filter the HR dataset

Load:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

Then build a DataFrame called young_low_pay containing only employees who:

  • Are younger than 30 (Age < 30), AND
  • Earn less than MonthlyIncome of 3000, AND
  • Did not leave the company (Attrition == "No").

Keep all columns.

Check your understanding

QuestionSelect one

Why must you use & instead of and when combining two boolean Series?

and does not exist in Python

It is shorter

and operates on single boolean values and tries to coerce a whole Series into one — which is ambiguous — while & is element-wise on Series

& is faster

Pandas hates English

QuestionSelect one

Which of these is the cleanest way to filter for rows whose dept column is one of "Eng", "Sales", or "Ops"?

df[df["dept"] == "Eng" or df["dept"] == "Sales" or df["dept"] == "Ops"]

df[df["dept"] == ["Eng", "Sales", "Ops"]]

df[df["dept"].isin(["Eng", "Sales", "Ops"])]

df.dept = "Eng" | "Sales" | "Ops"

QuestionSelect one

Why is df.loc[mask, "col"] = value preferred over df[mask]["col"] = value when modifying data?

The two are equivalent

The second is shorter

The chained form may silently operate on a copy and fail to update the original (the SettingWithCopyWarning); loc guarantees in-place update

The second is faster

On this page