Selecting Data
The many ways to grab the rows and columns you want — and why Pandas has so many of them.
There are at least five distinct syntaxes in Pandas for "give me some of the data." That sounds like too many until you realize each one solves a slightly different problem.
This chapter is the overview; the next one (loc vs iloc) zooms in on the two most important.
The five access methods
| Syntax | Picks by | Returns | Use when |
|---|---|---|---|
df["col"] | Column name | Series | Grab one column |
df[["a","b"]] | Column names | DataFrame | Grab several columns |
df.loc[...] | Labels | depends | Label-based row & column select |
df.iloc[...] | Positions | depends | Position-based select |
df[mask] | Boolean Series | DataFrame | Filter rows by a condition |
Let us see each in action on the same little DataFrame.
1. Pick one column → Series
2. Pick several columns → DataFrame
3. loc — by label
Important: loc slices are inclusive of both endpoints,
unlike Python lists.
4. iloc — by position
iloc follows normal Python slicing rules: end is exclusive.
5. Boolean masks → filter rows
This is the standard way to filter. We will spend more time on it in the Filtering Data chapter.
Why loc and iloc exist
You might wonder: why bother with two? The answer is that an integer is ambiguous when an index can hold any kind of label.
Imagine the index is [10, 20, 30]. What does df[1] mean?
The row labeled 1 (which doesn't exist) or the row at
position 1 (which is the 20-labeled row)?
loc and iloc remove the ambiguity:
df.loc[1]always means "the row with label 1."df.iloc[1]always means "the row at position 1."
Use them. The plain df[1] form is reserved for column-name
selection on DataFrames and tends to produce confusing errors.
Putting it together
A typical Pandas line you will see often:
df.loc[df["status"] == "active", ["customer_id", "revenue"]]Parse it from outside in:
- Outer
loc[...]— label-based selection. - Row selector (
df["status"] == "active") — a boolean mask. - Column selector (
["customer_id", "revenue"]) — list of column labels.
The whole expression reads as: "From df, the rows where status is active, keeping only the customer_id and revenue columns."
This is the canonical shape of an analyst's daily code.
Quick exercise
Load:
https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv
Then produce three results:
income_series— theMonthlyIncomecolumn as a Series.age_income— a DataFrame with justAgeandMonthlyIncome.first_ten— the first 10 rows of the original DataFrame usingiloc.
Check your understanding
Which of these selects the single column age from df?
df.age (attribute access — works but discouraged)
df[["age"]] (returns a DataFrame)
df["age"] (returns a Series)
df.loc[:, "age":"age"] (a label slice)
In Pandas, df.loc["Aiko":"Chen"] versus df.iloc[0:3] — what is the key difference?
loc is faster
They are identical
loc slicing is inclusive of both endpoints, while iloc (like normal Python) is exclusive of the end
iloc accepts strings
Reading df.loc[df["status"] == "active", ["id", "rev"]] out loud, what does it return?
An error
All rows of df
The rows of df where status equals "active", keeping only the columns id and rev
The first row