Inspecting a Dataset
Before you analyze a dataset, you have to *meet* it. The five-minute ritual every analyst performs the moment a new dataset lands on their desk.
When a new dataset arrives, the temptation is to dive straight into the interesting question — "do customers in region X spend more than customers in region Y?" — and start running analyses.
That is almost always a mistake. The very first thing every experienced analyst does is something more modest: find out what you actually have. How big is it? What are the columns? Are there missing values? Are the numbers in sensible ranges? Are there obviously wrong rows?
This page is the ritual. Memorize it.
The five-minute ritual
Each step takes seconds and gives you orientation. Skip any of them and you risk asking a meaningless question of a misunderstood dataset.
We'll use airquality — a built-in dataset of daily air quality
measurements in New York, May–September 1973 — as our example.
Step 1: shape
The shape tells you what kind of analysis is even feasible. 153 rows × 6 columns is small — every operation will be instant. A million rows × 200 columns is a different problem entirely.
Step 2: peek
Glance at the first few and last few rows. You're checking:
- Did the data load correctly? (No mangled headers, no obvious parsing errors.)
- Do the values look like what you'd expect?
- Are there obvious anomalies?
It is astonishing how often a 10-second head() catches a problem
that would have wasted an hour downstream.
Step 3: structure and types
str() packs an enormous amount of information into a few lines:
how many rows, how many columns, every column name, every column
type, and a sample of values.
What you're looking for:
- Are types right? (Is the "amount" column truly numeric, or did it accidentally get loaded as character because of a stray comma?)
- Are dates being treated as dates or as strings?
- Are categorical variables stored as factors, characters, or numbers?
Step 4: summaries
summary() gives you, per column:
- For numerics: min, 1st quartile, median, mean, 3rd quartile,
max, and the count of
NAs - For factors: counts per level
- For characters: just the count
You're scanning for:
- Implausible values. Negative ages. Heights of 3 meters. Temperatures of 999 (a common "missing" sentinel).
- Suspiciously round means. A mean of exactly 0 in a column that should be variable is suspicious.
- Mean very different from median. Suggests skew or outliers.
- Lots of NAs.
In airquality, look at the Ozone column. The max is 168 but
the median is around 30 — a heavy right tail. That's a real
finding, not a glitch, but worth noting.
Step 5: missingness
The default summary() does report NAs, but it's often
clearer to ask explicitly:
In airquality, Ozone and Solar.R have meaningful amounts of
missing data. That will affect how you compute averages, how you
plot, how you join with other data, everything. Knowing it
upfront lets you choose your strategy.
A multi-file walkthrough: meet iris
Let's apply the ritual to a fresh dataset, with the inspection split across files so you can see each step clearly:
Use the inspection ritual on iris. The starter code already loads the data, and there's a helper file with a "report" function. Complete main.R so the script prints shape, head, structure, and a summary. The test just checks that nothing errors.
Test your understanding
Which function gives you a compact view of a data frame's structure: number of rows, number of columns, every column's type, and a sample of its values?
summary()
head()
str()
dim()
Why is it a bad idea to skip the "inspect the dataset" ritual?
It's required by R; the language won't compute without it.
Because real datasets contain surprises (missing values, wrong types, implausible values) that can silently corrupt every downstream analysis.
Because summary() is the only way to compute means.
Because it's a tradition.
If mean(airquality$Ozone) returns NA, the most likely reason is:
The mean function is broken.
The dataset has fewer than 30 rows.
The Ozone column contains NA values, and mean() returns NA unless told otherwise.
R refuses to compute means on built-in datasets.
Inspecting tells us what's there. The next page is about extracting just the parts we want.
Data Frames
R's spreadsheet-on-steroids. A data frame is just a collection of equal-length vectors — but that simple idea is enough to organize 90% of the data you'll ever work with.
Subsetting and Filtering
How to ask a dataset the question "show me only the rows I care about, and only the columns I need" — the everyday operation of data analysis.