Rows and Columns
The two-dimensional shape every dataset eventually wears, and why thinking in rows-as-records / columns-as-attributes will unlock the rest of the course.
Almost every dataset you will analyze in your life eventually takes the shape of a rectangle of rows and columns. This sentence sounds trivial. It is not. Internalizing the meaning of rows and columns is the single biggest conceptual unlock for Pandas.
The basic mental model
- A row is one "thing." One person, one transaction, one sensor reading, one survey response, one football play.
- A column is one "attribute" of those things. Age, price, temperature, satisfaction score, yards gained.
- The cell at row , column is that thing's value of that attribute.
When you read a Pandas tutorial and see df.shape == (1470, 35),
that means "1,470 things, each with 35 attributes." When you see
df.groupby("department"), it means "split the rows into groups
by their value in the department column, then do something with
each group."
That is essentially the whole course in two paragraphs.
Why "rows are things"
It is tempting to think of a row as just some numbers. But naming the thing each row represents (and naming it the same across every dataset in a project) is the difference between analyses that are easy to read and analyses that are mysterious.
In this course we will be very explicit about it:
| Dataset | Each row is one… |
|---|---|
| HR employees | Employee |
| Retail orders | Order (or order line) |
| Weather observations | Measurement at a location at a time |
| Public health records | Patient visit |
| Movies | Movie |
| Survey responses | Respondent |
| Transportation logs | Trip |
Naming the row clearly is also the first sanity check on a dataset. "Is each row really one customer?" Sometimes a CSV that looks customer-shaped is actually customer × month, with one row per customer per month. The shape of subsequent analysis depends on getting this right.
The grain of the data
The size of the "thing" a row represents is called the grain of the dataset. Customer-month grain is finer than customer grain. Most data bugs come from analysts not noticing that two datasets they are joining are at different grains.
Why "columns are attributes"
Columns share three properties that you will rely on constantly:
- A column has a name. That name is the human-facing label for the attribute.
- A column has a type. All its values are (in principle) the same kind of thing — integers, strings, dates.
- A column is the unit of most Pandas operations. You filter rows based on a column. You group rows by a column. You sum, average, or count a column.
The Pandas operation df["salary"].mean() reads naturally as
"the mean of the salary column" because the column is the
operand.
Putting it together
Let us make a small dataset by hand to feel the rows-and-columns geometry.
Now look at how naturally the basic operations read:
Reading the code out loud as "rows where ... average of ..." matches the English question almost word for word.
The shape of "long" vs "wide"
The same logical information can be arranged in two very different rectangle shapes. Consider monthly sales for three products:
Long form (also called tidy):
product month sales
Latte Jan 1200
Latte Feb 1100
Latte Mar 1500
Mocha Jan 800
Mocha Feb 950
Mocha Mar 900
Espresso Jan 400
Espresso Feb 450
Espresso Mar 500Wide form (more spreadsheet-like):
product Jan Feb Mar
Latte 1200 1100 1500
Mocha 800 950 900
Espresso 400 450 500Both contain the same information. But the row in long form is one (product, month) measurement; in wide form it is one product, across many months. The grain has changed. We will spend a whole chapter (Wide vs Long) on when each shape is better and how to convert between them.
Prefer long when manipulating, wide when displaying
Pandas operations (groupby, filter, plot) are much easier on long-form data because each row is a single observation. Humans, however, often find wide form easier to read. A common workflow is: keep your working copy long, pivot to wide only for display.
Visualizing a Pandas slice
When you select a column, you get back a Series (a labeled 1-D array). When you select a row, you also get a Series, but indexed by the column names. This duality — same underlying type, different axis — is a recurring theme.
We will study Series and DataFrames in depth in the DataFrames and Series chapter. For now, just know that they share a lot of machinery.
A small challenge
Try the challenge below to cement the rows-as-records, columns-as-attributes idea.
Construct a DataFrame called students with exactly 5 rows and 4 columns. Each row should represent one student. The columns should be:
name— a stringgrade— an integer between 1 and 12gpa— a float between 0.0 and 4.0passed_math— a boolean
You can pick any names, grades, GPAs, and pass/fail values.
Check your understanding
In a DataFrame of retail orders where each row represents one order line, which of these is most appropriate as a column?
The store address
A pie chart of the orders
The quantity of items in that order line
A spreadsheet formula
What does the term grain mean in the context of a tabular dataset?
The font used to display the data
The total size of the dataset in bytes
The level of detail each row represents — for example "one row per customer" versus "one row per customer-month"
The color scheme of the table
Long form versus wide form: which one is generally easier for Pandas to filter, group, and plot?
Wide
They are equivalent
Long
Neither