Loading Data with Pandas
The minimum pandas you need to feed Plotly Express — DataFrames, columns, and basic filtering
Plotly Express expects its data in a pandas DataFrame. If you've never used pandas before, that sentence might sound intimidating — but a DataFrame is just a table, and you only need to know a handful of operations before you can make every chart in this course.
This page is the just-enough pandas tutorial. We will not turn you into a pandas expert; we will get you fluent enough to feed Plotly Express.
What is a DataFrame?
A DataFrame is a table in memory. It has:
- Rows — usually one per observation (one country, one transaction, one student).
- Columns — named, typed fields (e.g.,
country,year,gdpPercap). - An index — row labels, usually just
0, 1, 2, ...by default.
If you've used Excel, a DataFrame is roughly "one worksheet, with named column headers."
That prints a nicely formatted table. df.columns lists the column
names. len(df) gives the number of rows.
Three ways data gets into a DataFrame
For this course, you'll see data arrive in three ways:
-
Built-in dataset. Plotly Express ships with several classic datasets:
df = px.data.gapminder() # life-expectancy & GDP df = px.data.iris() # flower measurements df = px.data.tips() # restaurant tipping df = px.data.stocks() # daily stock prices -
Hand-built dictionary. Useful for tiny illustrative examples:
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}) -
A CSV (or Excel) file. The real-world case:
df = pd.read_csv("sales.csv")In your own work this is the most common pattern.
For this course, we'll mostly use options 1 and 2, because no file uploads are needed in the browser runtime.
Peeking at a DataFrame
When you receive a DataFrame, your first instinct should be to look at it. Pandas provides a few essential methods:
Every analyst, every time they load a new dataset, runs some
combination of df.head(), df.shape, df.columns, and
df.describe(). Build the habit.
Selecting a column
To work with a single column, use square brackets:
df["lifeExp"] # the lifeExp column, as a SeriesYou can also use dot notation if the column name is a valid
Python identifier: df.lifeExp. Bracket notation is more
universal — use it.
Filtering rows
Plotly Express works on whatever you pass it, so filtering before plotting is a common pattern. There are two ways to filter:
Boolean indexing
df_2007 = df[df["year"] == 2007]
df_rich = df[df["gdpPercap"] > 30000]
df_both = df[(df["year"] == 2007) & (df["gdpPercap"] > 30000)]The & (and) and | (or) operators combine conditions. You must
wrap each condition in parentheses — that's a Python operator-
precedence quirk that trips up everyone exactly once.
The .query() method
A more readable alternative for simple filters:
df.query("year == 2007")
df.query("year == 2007 and gdpPercap > 30000")You will see both styles in the wild. Pick whichever reads better to you in context.
Sorting
Sort by a column with sort_values:
df.sort_values("gdpPercap", ascending=False)This is invaluable before charting. A bar chart sorted by value is almost always more readable than the same chart in alphabetical order.
Notice how much more useful the bar chart is when sorted by value — the eye reads "ranking" instantly.
Group + aggregate (just a peek)
A common pattern is to summarize a DataFrame before plotting:
"average GDP per continent," "total sales per region," etc. The
pandas pattern is groupby + an aggregation:
avg_by_continent = (
df.query("year == 2007")
.groupby("continent")["lifeExp"]
.mean()
.reset_index()
)reset_index() is the magic that turns the grouped result back
into a normal DataFrame so Plotly Express can consume it.
"Tidy" data: the secret to easy plotting
There's one big principle that will save you endless frustration: Plotly Express prefers tidy (long) data.
Tidy data follows three rules:
- Each variable is a column.
- Each observation is a row.
- Each cell is a single value.
A common non-tidy layout looks like:
| city | 2020 | 2021 | 2022 |
|---|---|---|---|
| Boston | 100 | 120 | 150 |
| Tokyo | 200 | 220 | 240 |
This is "wide" — years are columns. To plot it with Plotly Express, you almost always want to reshape it to long:
| city | year | value |
|---|---|---|
| Boston | 2020 | 100 |
| Boston | 2021 | 120 |
| Boston | 2022 | 150 |
| Tokyo | 2020 | 200 |
| Tokyo | 2021 | 220 |
| Tokyo | 2022 | 240 |
The tool to do that is df.melt(...):
This wide → long reshape pattern is one of the most common moves
in real visualization work. Don't worry about memorizing
melt() — when you need it, you'll look it up. Just recognize
when your data is wide and Plotly Express is complaining.
Check your understanding
What does df.head() do?
It removes the first row of the DataFrame.
It deletes the column headers.
It returns the first 5 rows of the DataFrame (or n rows if you pass df.head(n)).
It downloads the DataFrame.
Which of the following filters a DataFrame df to keep only rows where year equals 2007?
df.filter(year == 2007)
df.where(year == 2007)
df[df["year"] == 2007] (or df.query("year == 2007"))
df.select("year == 2007")
What does it mean for data to be in tidy (long) format?
The data has been spell-checked.
All numbers are integers.
Each variable is one column, each observation is one row, each cell holds a single value — making it easy to map columns to chart encodings.
The data has fewer than 100 rows.
Why is sorting by value important before drawing a bar chart of categories?
Sorting is required by the matplotlib library.
Sorted data is faster to plot.
The eye reads a ranked / sorted bar chart almost instantly — alphabetical order obscures the actual ranking of values.