Dataslope logoDataslope

Selecting Data

The many ways to grab the rows and columns you want — and why Pandas has so many of them.

There are at least five distinct syntaxes in Pandas for "give me some of the data." That sounds like too many until you realize each one solves a slightly different problem.

This chapter is the overview; the next one (loc vs iloc) zooms in on the two most important.

The five access methods

SyntaxPicks byReturnsUse when
df["col"]Column nameSeriesGrab one column
df[["a","b"]]Column namesDataFrameGrab several columns
df.loc[...]LabelsdependsLabel-based row & column select
df.iloc[...]PositionsdependsPosition-based select
df[mask]Boolean SeriesDataFrameFilter rows by a condition

Let us see each in action on the same little DataFrame.

Code Block
Python 3.13.2

1. Pick one column → Series

Code Block
Python 3.13.2

2. Pick several columns → DataFrame

Code Block
Python 3.13.2

3. loc — by label

Code Block
Python 3.13.2

Important: loc slices are inclusive of both endpoints, unlike Python lists.

4. iloc — by position

Code Block
Python 3.13.2

iloc follows normal Python slicing rules: end is exclusive.

5. Boolean masks → filter rows

Code Block
Python 3.13.2

This is the standard way to filter. We will spend more time on it in the Filtering Data chapter.

Why loc and iloc exist

You might wonder: why bother with two? The answer is that an integer is ambiguous when an index can hold any kind of label.

Imagine the index is [10, 20, 30]. What does df[1] mean? The row labeled 1 (which doesn't exist) or the row at position 1 (which is the 20-labeled row)?

loc and iloc remove the ambiguity:

  • df.loc[1] always means "the row with label 1."
  • df.iloc[1] always means "the row at position 1."

Use them. The plain df[1] form is reserved for column-name selection on DataFrames and tends to produce confusing errors.

Putting it together

A typical Pandas line you will see often:

df.loc[df["status"] == "active", ["customer_id", "revenue"]]

Parse it from outside in:

  1. Outer loc[...] — label-based selection.
  2. Row selector (df["status"] == "active") — a boolean mask.
  3. Column selector (["customer_id", "revenue"]) — list of column labels.

The whole expression reads as: "From df, the rows where status is active, keeping only the customer_id and revenue columns."

This is the canonical shape of an analyst's daily code.

Quick exercise

Challenge
Python 3.13.2
Practice selecting on the HR data

Load:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

Then produce three results:

  1. income_series — the MonthlyIncome column as a Series.
  2. age_income — a DataFrame with just Age and MonthlyIncome.
  3. first_ten — the first 10 rows of the original DataFrame using iloc.

Check your understanding

QuestionSelect one

Which of these selects the single column age from df?

df.age (attribute access — works but discouraged)

df[["age"]] (returns a DataFrame)

df["age"] (returns a Series)

df.loc[:, "age":"age"] (a label slice)

QuestionSelect one

In Pandas, df.loc["Aiko":"Chen"] versus df.iloc[0:3] — what is the key difference?

loc is faster

They are identical

loc slicing is inclusive of both endpoints, while iloc (like normal Python) is exclusive of the end

iloc accepts strings

QuestionSelect one

Reading df.loc[df["status"] == "active", ["id", "rev"]] out loud, what does it return?

An error

All rows of df

The rows of df where status equals "active", keeping only the columns id and rev

The first row

On this page