Data Frames
R's spreadsheet-on-steroids. A data frame is just a collection of equal-length vectors — but that simple idea is enough to organize 90% of the data you'll ever work with.
A real dataset rarely lives as a single vector. It lives as a table: rows for observations, columns for variables. In R, that table is called a data frame — and it's the central data structure for nearly all statistical and data work.
A data frame is, fundamentally, just a list of equal-length vectors, displayed as a table. Each column is a vector. All columns must have the same length. That's it.
Creating a data frame from scratch
You can build one with data.frame():
When you print a data frame in WebR, you get a real rendered
HTML table. Each column has its own type — name and dept are
character, salary is numeric, remote is logical.
The columns are vectors. The table just displays them side-by-side.
Inspecting a data frame
R provides a small toolkit of inspection functions. You will use these at the start of every real analysis:
Of these, str() and summary() are the two you will use the
most. str() shows the shape of the data; summary() shows the
content.
Accessing columns
A data frame is a list of columns. You access a column with $
or with [["..."]]:
Once you've grabbed a column, you're back in vector land — every trick from the last four pages works.
Indexing rows and columns: [row, col]
A data frame can also be indexed with two-dimensional [ ]
notation. The convention is [rows, columns]. Leave a slot empty
to mean "all":
That last line is the classic data-frame filter: use a logical vector built from one of the columns to select rows.
Adding and modifying columns
Adding a new column is just like assigning to a name with $:
Vectorized arithmetic works exactly as you'd hope —
employees$salary * 0.10 produces a new vector of bonuses, which
is then stored as a new column.
Built-in datasets: your sandbox
R comes with dozens of built-in datasets you can experiment with freely. A few favorites we'll use throughout the course:
Real datasets, even classics like these, have quirks: missing values, weird scales, oddly-named columns. Half the joy of EDA is finding them.
Test your understanding
Fundamentally, what is a data frame in R?
A single multi-dimensional array
An Excel file
A list of equal-length vectors (one per column), displayed as a table
A SQL table
Which expression returns the mpg column of mtcars as a vector?
mtcars[mpg]
mtcars$mpg
mtcars.mpg
mtcars(mpg)
What does mtcars[mtcars$mpg > 25, ] return?
The single value of mpg greater than 25
The column mpg filtered
All rows of mtcars where mpg > 25, with all columns kept
An error
Mini challenge: build and summarize a small data frame
Build a data frame students with columns name (character),
grade (integer), and score (numeric), then compute the
avg_score of all students.
Create a data frame called students with these three rows:
- "Ada", grade 10, score 92
- "Ben", grade 11, score 78
- "Cleo", grade 10, score 85
Then assign avg_score to the mean of the score column.
Now that we can create and inspect data frames, the next page focuses on the very first thing a real data analyst does with a new dataset: look at it.
Missing Values (NA)
Real data is full of holes. R has a first-class concept — `NA` — for representing "I don't know," and a small set of rules for working with it correctly.
Inspecting a Dataset
Before you analyze a dataset, you have to *meet* it. The five-minute ritual every analyst performs the moment a new dataset lands on their desk.