Dataslope logoDataslope

Loading Data with Pandas

The minimum pandas you need to feed Plotly Express — DataFrames, columns, and basic filtering

Plotly Express expects its data in a pandas DataFrame. If you've never used pandas before, that sentence might sound intimidating — but a DataFrame is just a table, and you only need to know a handful of operations before you can make every chart in this course.

This page is the just-enough pandas tutorial. We will not turn you into a pandas expert; we will get you fluent enough to feed Plotly Express.

What is a DataFrame?

A DataFrame is a table in memory. It has:

  • Rows — usually one per observation (one country, one transaction, one student).
  • Columns — named, typed fields (e.g., country, year, gdpPercap).
  • An index — row labels, usually just 0, 1, 2, ... by default.

If you've used Excel, a DataFrame is roughly "one worksheet, with named column headers."

Code Block
Python 3.13.2

That prints a nicely formatted table. df.columns lists the column names. len(df) gives the number of rows.

Three ways data gets into a DataFrame

For this course, you'll see data arrive in three ways:

  1. Built-in dataset. Plotly Express ships with several classic datasets:

    df = px.data.gapminder()    # life-expectancy & GDP
    df = px.data.iris()         # flower measurements
    df = px.data.tips()         # restaurant tipping
    df = px.data.stocks()       # daily stock prices
  2. Hand-built dictionary. Useful for tiny illustrative examples:

    df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
  3. A CSV (or Excel) file. The real-world case:

    df = pd.read_csv("sales.csv")

    In your own work this is the most common pattern.

For this course, we'll mostly use options 1 and 2, because no file uploads are needed in the browser runtime.

Peeking at a DataFrame

When you receive a DataFrame, your first instinct should be to look at it. Pandas provides a few essential methods:

Code Block
Python 3.13.2

Every analyst, every time they load a new dataset, runs some combination of df.head(), df.shape, df.columns, and df.describe(). Build the habit.

Selecting a column

To work with a single column, use square brackets:

df["lifeExp"]   # the lifeExp column, as a Series

You can also use dot notation if the column name is a valid Python identifier: df.lifeExp. Bracket notation is more universal — use it.

Code Block
Python 3.13.2

Filtering rows

Plotly Express works on whatever you pass it, so filtering before plotting is a common pattern. There are two ways to filter:

Boolean indexing

df_2007 = df[df["year"] == 2007]
df_rich = df[df["gdpPercap"] > 30000]
df_both = df[(df["year"] == 2007) & (df["gdpPercap"] > 30000)]

The & (and) and | (or) operators combine conditions. You must wrap each condition in parentheses — that's a Python operator- precedence quirk that trips up everyone exactly once.

The .query() method

A more readable alternative for simple filters:

df.query("year == 2007")
df.query("year == 2007 and gdpPercap > 30000")

You will see both styles in the wild. Pick whichever reads better to you in context.

Code Block
Python 3.13.2

Sorting

Sort by a column with sort_values:

df.sort_values("gdpPercap", ascending=False)

This is invaluable before charting. A bar chart sorted by value is almost always more readable than the same chart in alphabetical order.

Code Block
Python 3.13.2

Notice how much more useful the bar chart is when sorted by value — the eye reads "ranking" instantly.

Group + aggregate (just a peek)

A common pattern is to summarize a DataFrame before plotting: "average GDP per continent," "total sales per region," etc. The pandas pattern is groupby + an aggregation:

avg_by_continent = (
    df.query("year == 2007")
      .groupby("continent")["lifeExp"]
      .mean()
      .reset_index()
)

reset_index() is the magic that turns the grouped result back into a normal DataFrame so Plotly Express can consume it.

Code Block
Python 3.13.2

"Tidy" data: the secret to easy plotting

There's one big principle that will save you endless frustration: Plotly Express prefers tidy (long) data.

Tidy data follows three rules:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each cell is a single value.

A common non-tidy layout looks like:

city202020212022
Boston100120150
Tokyo200220240

This is "wide" — years are columns. To plot it with Plotly Express, you almost always want to reshape it to long:

cityyearvalue
Boston2020100
Boston2021120
Boston2022150
Tokyo2020200
Tokyo2021220
Tokyo2022240

The tool to do that is df.melt(...):

Code Block
Python 3.13.2

This wide → long reshape pattern is one of the most common moves in real visualization work. Don't worry about memorizing melt() — when you need it, you'll look it up. Just recognize when your data is wide and Plotly Express is complaining.

Check your understanding

QuestionSelect one

What does df.head() do?

It removes the first row of the DataFrame.

It deletes the column headers.

It returns the first 5 rows of the DataFrame (or n rows if you pass df.head(n)).

It downloads the DataFrame.

QuestionSelect one

Which of the following filters a DataFrame df to keep only rows where year equals 2007?

df.filter(year == 2007)

df.where(year == 2007)

df[df["year"] == 2007] (or df.query("year == 2007"))

df.select("year == 2007")

QuestionSelect one

What does it mean for data to be in tidy (long) format?

The data has been spell-checked.

All numbers are integers.

Each variable is one column, each observation is one row, each cell holds a single value — making it easy to map columns to chart encodings.

The data has fewer than 100 rows.

QuestionSelect one

Why is sorting by value important before drawing a bar chart of categories?

Sorting is required by the matplotlib library.

Sorted data is faster to plot.

The eye reads a ranked / sorted bar chart almost instantly — alphabetical order obscures the actual ranking of values.

On this page