Dataslope logoDataslope

Filtering Before Plotting

Why slicing your data is half the chart, and how to do it cleanly

Plotly Express is happy to draw whatever you pass it — including 2 million rows of irrelevant data. A lot of "bad" charts are really "right chart of the wrong slice of data." Spending 30 seconds filtering your DataFrame before the plot call is one of the highest-leverage habits in visualization work.

This page is about the pre-chart step: choosing what to show.

Why filter at all?

Most analytic questions are about a subset of the data:

  • "How are sales in Europe trending?"
  • "What did our active users do last week?"
  • "Show me the top 10 products."

If you skip the filter, the chart drowns in noise. Worse, the chart may technically be correct but answer the wrong question.

The big four operations

Four pandas operations cover almost every pre-chart need:

1. Boolean filter

df_2007 = df[df["year"] == 2007]
df_eu   = df[df["continent"] == "Europe"]
df_both = df[(df["year"] == 2007) & (df["continent"] == "Europe")]

Or the equivalent .query() form:

df.query("year == 2007 and continent == 'Europe'")

2. Top-N

top10 = df.sort_values("revenue", ascending=False).head(10)

Almost any "top N" chart is just a sort + head + bar chart.

3. Group + aggregate

avg_by_region = (
    df.groupby("region", as_index=False)["sales"].mean()
)

groupby collapses rows into one per group; the aggregation function says how to combine.

4. Date range

df_2024 = df[df["date"].between("2024-01-01", "2024-12-31")]

For time series, almost every chart starts with a date filter.

Example: from raw data to a clean chart

Let's build a story together: "Which European countries had the biggest gain in life expectancy between 1952 and 2007?"

Code Block
Python 3.13.2

Notice how much of the work is before the px.bar line. The chart itself is one line; the data preparation is four. That's the normal ratio.

Common filter mistakes

  • Forgetting to copy. When you do df_eu = df[df["continent"] == "Europe"] and then mutate df_eu, you may get a warning about chained assignment. Use df_eu = df[...].copy() if you plan to add columns.
  • Filtering after aggregating. Filter first, then aggregate. Aggregating first means you can't recover the rows you wanted.
  • Filtering with == vs .isin(). For multiple values, use df[df["continent"].isin(["Europe", "Asia"])], not chained ORs.

The .query() style

.query("...") is often more readable for simple filters:

df.query("year == 2007 and continent == 'Europe' and lifeExp > 80")

is equivalent to:

df[(df["year"] == 2007) & (df["continent"] == "Europe") & (df["lifeExp"] > 80)]

Use whichever reads better in your context.

Why "show less" usually beats "show more"

A chart of 200 countries is hard to read; a chart of the top 15 is easy. A chart of every day of the year is busy; a chart facetted by month is clear.

When you find yourself fighting a chart, the answer is often not to add more encodings — it's to remove rows. Filter first.

Check your understanding

QuestionSelect one

Why is filtering before plotting often more important than configuring the chart itself?

Filtering makes the chart render faster.

Filtering is required by Plotly.

A chart of the right subset is almost always more readable and more relevant than the same chart drawn on the entire dataset.

Filtering converts the DataFrame to a chart.

QuestionSelect one

Which pandas idiom keeps rows where the continent column is in ["Europe", "Asia"]?

df[df["continent"] == ["Europe", "Asia"]]

df[df["continent"] in ["Europe", "Asia"]]

df[df["continent"].isin(["Europe", "Asia"])]

df.filter(continent=["Europe", "Asia"])

QuestionSelect one

If your chart of every customer is unreadable, the best first move is usually to:

Add more colors.

Make the chart taller.

Filter to a meaningful subset — top N, recent dates, a specific segment — before plotting.

Switch to a 3-D chart.

On this page