Loading Data with Pandas

The minimum pandas you need to feed Plotly Express — DataFrames, columns, and basic filtering

Plotly Express expects its data in a pandas DataFrame. If you've never used pandas before, that sentence might sound intimidating — but a DataFrame is just a table, and you only need to know a handful of operations before you can make every chart in this course.

This page is the just-enough pandas tutorial. We will not turn you into a pandas expert; we will get you fluent enough to feed Plotly Express.

What is a DataFrame?

A DataFrame is a table in memory. It has:

Rows — usually one per observation (one country, one transaction, one student).
Columns — named, typed fields (e.g., country, year, gdpPercap).
An index — row labels, usually just 0, 1, 2, ... by default.

If you've used Excel, a DataFrame is roughly "one worksheet, with named column headers."

That prints a nicely formatted table. df.columns lists the column names. len(df) gives the number of rows.

Three ways data gets into a DataFrame

For this course, you'll see data arrive in three ways:

Built-in dataset. Plotly Express ships with several classic datasets:

df = px.data.gapminder()    # life-expectancy & GDP
df = px.data.iris()         # flower measurements
df = px.data.tips()         # restaurant tipping
df = px.data.stocks()       # daily stock prices

Hand-built dictionary. Useful for tiny illustrative examples:
```
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
```
A CSV (or Excel) file. The real-world case:
```
df = pd.read_csv("sales.csv")
```
In your own work this is the most common pattern.

For this course, we'll mostly use options 1 and 2, because no file uploads are needed in the browser runtime.

Peeking at a DataFrame

When you receive a DataFrame, your first instinct should be to look at it. Pandas provides a few essential methods:

Every analyst, every time they load a new dataset, runs some combination of df.head(), df.shape, df.columns, and df.describe(). Build the habit.

Selecting a column

To work with a single column, use square brackets:

df["lifeExp"]   # the lifeExp column, as a Series

You can also use dot notation if the column name is a valid Python identifier: df.lifeExp. Bracket notation is more universal — use it.

Filtering rows

Plotly Express works on whatever you pass it, so filtering before plotting is a common pattern. There are two ways to filter:

Boolean indexing

df_2007 = df[df["year"] == 2007]
df_rich = df[df["gdpPercap"] > 30000]
df_both = df[(df["year"] == 2007) & (df["gdpPercap"] > 30000)]

The & (and) and | (or) operators combine conditions. You must wrap each condition in parentheses — that's a Python operator- precedence quirk that trips up everyone exactly once.

The `.query()` method

A more readable alternative for simple filters:

df.query("year == 2007")
df.query("year == 2007 and gdpPercap > 30000")

You will see both styles in the wild. Pick whichever reads better to you in context.

Sorting

Sort by a column with sort_values:

df.sort_values("gdpPercap", ascending=False)

This is invaluable before charting. A bar chart sorted by value is almost always more readable than the same chart in alphabetical order.

Notice how much more useful the bar chart is when sorted by value — the eye reads "ranking" instantly.

Group + aggregate (just a peek)

A common pattern is to summarize a DataFrame before plotting: "average GDP per continent," "total sales per region," etc. The pandas pattern is groupby + an aggregation:

avg_by_continent = (
    df.query("year == 2007")
      .groupby("continent")["lifeExp"]
      .mean()
      .reset_index()
)

reset_index() is the magic that turns the grouped result back into a normal DataFrame so Plotly Express can consume it.

"Tidy" data: the secret to easy plotting

There's one big principle that will save you endless frustration: Plotly Express prefers tidy (long) data.

Tidy data follows three rules:

Each variable is a column.
Each observation is a row.
Each cell is a single value.

A common non-tidy layout looks like:

city	2020	2021	2022
Boston	100	120	150
Tokyo	200	220	240

This is "wide" — years are columns. To plot it with Plotly Express, you almost always want to reshape it to long:

city	year	value
Boston	2020	100
Boston	2021	120
Boston	2022	150
Tokyo	2020	200
Tokyo	2021	220
Tokyo	2022	240

The tool to do that is df.melt(...):

This wide → long reshape pattern is one of the most common moves in real visualization work. Don't worry about memorizing melt() — when you need it, you'll look it up. Just recognize when your data is wide and Plotly Express is complaining.

Check your understanding

QuestionSelect one

What does df.head() do?

It removes the first row of the DataFrame.

It deletes the column headers.

It returns the first 5 rows of the DataFrame (or n rows if you pass df.head(n)).

It downloads the DataFrame.

QuestionSelect one

Which of the following filters a DataFrame df to keep only rows where year equals 2007?

df.filter(year == 2007)

df.where(year == 2007)

df[df["year"] == 2007] (or df.query("year == 2007"))

df.select("year == 2007")

QuestionSelect one

What does it mean for data to be in tidy (long) format?

The data has been spell-checked.

All numbers are integers.

Each variable is one column, each observation is one row, each cell holds a single value — making it easy to map columns to chart encodings.

The data has fewer than 100 rows.

QuestionSelect one

Why is sorting by value important before drawing a bar chart of categories?

Sorting is required by the matplotlib library.

Sorted data is faster to plot.

The eye reads a ranked / sorted bar chart almost instantly — alphabetical order obscures the actual ranking of values.

Introducing Plotly Express

Your first proper tour of the library — what it imports, what it returns, and how to read its function names

Your First Chart

A guided walkthrough of building a chart from scratch — every choice explained

Loading Data with Pandas

On this page