Mini Project Walkthrough
A complete end-to-end analysis of a real built-in dataset — load, inspect, tidy, transform, summarize, visualize, interpret. This is the workflow every analysis follows.
You've now met every piece of the puzzle: vectors, data frames, dplyr verbs, ggplot, summary statistics, and inference. This page strings them all together on a real dataset so you can see the whole arc of an analysis at once.
We'll use the built-in airquality dataset — daily air
quality measurements from New York during May–September 1973.
Our question:
Is ozone higher on hotter days, and how does it vary across the summer months?
That single question will pull us through the entire workflow.
Step 1 — Load and inspect
Look before you leap. Always.
What we learn:
- ~153 rows, 6 columns:
Ozone,Solar.R,Wind,Temp,Month,Day. OzoneandSolar.Rhave NAs — missing values we'll need to handle.Monthis numeric (5 through 9) but conceptually categorical.Tempis in degrees Fahrenheit.
Step 2 — Tidy: handle NAs and fix types
A few notes:
- We dropped rows missing Ozone because Ozone is the variable
we want to analyze. (
Solar.RNAs we'll handle later if needed.) Dropping rows is fine when you've thought about whether NAs are random — for now, assume yes. - We added a
Month_namefactor with proper ordering, and a more reader-friendlyTemp_Ccolumn.
Step 3 — First look: summary statistics
Already the pattern emerges: July and August have the highest mean Ozone and the highest mean temperature. May and September are cooler and cleaner.
Step 4 — Visualize the relationship
Plots will tell us more than summary tables ever can.
The picture is unambiguous: warmer days have higher Ozone. The relationship looks roughly monotonic, and a bit curvier at high temperatures.
Step 5 — Look at monthly variation explicitly
July and August have clearly higher medians and wider spreads — both more Ozone and more variable Ozone. The shoulder months (May, September) are tighter and lower.
Step 6 — A quick formal check
We see a relationship between Temperature and Ozone, but how strong is it? And is it surprising given the noise?
Reading the output:
- The correlation is around 0.7 — a strong positive linear relationship.
- The regression coefficient on
Temp_Cis positive and the p-value is very small, well below any conventional cutoff. - The R² (multiple R-squared in the output) tells you what fraction of Ozone variation is explained by temperature alone.
We're not making causal claims (Ozone formation has many drivers!) — only describing a strong association in this dataset.
Step 7 — Interpret
Pulling it together in a short, plain-English paragraph (this is literally the kind of write-up an analyst delivers):
Across the 1973 New York summer, daily ozone levels rose sharply with daily temperature. The relationship is strong (correlation ≈ 0.7) and is very unlikely to be a chance pattern given the sample size. Ozone was both higher and more variable in July and August than in May or September, consistent with the known link between heat, sunlight, and photochemical ozone production. About half the day-to-day variation in ozone is explained by temperature alone; the rest is presumably driven by factors not in this dataset, like wind, precursor emissions, and atmospheric mixing.
That paragraph is the deliverable. Everything before it — loading, cleaning, plotting, modeling — was scaffolding to make it trustworthy.
Putting it all in one script
Here's the entire analysis condensed:
# air-quality-analysis.R
library(dplyr)
library(ggplot2)
# 1. Load + tidy
aq <- airquality |>
filter(!is.na(Ozone)) |>
mutate(
Month_name = factor(Month, levels = 5:9,
labels = c("May","Jun","Jul","Aug","Sep")),
Temp_C = (Temp - 32) * 5/9
)
# 2. Summarize
monthly <- aq |>
group_by(Month_name) |>
summarise(
n = n(),
median_O = median(Ozone),
mean_O = mean(Ozone),
mean_T = mean(Temp_C),
.groups = "drop"
)
print(monthly)
# 3. Model
cat("Correlation:", cor(aq$Temp_C, aq$Ozone), "\n")
print(summary(lm(Ozone ~ Temp_C, data = aq)))
# 4. Plot
ggplot(aq, aes(Temp_C, Ozone, color = Month_name)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "loess", se = FALSE, color = "black") +
labs(title = "Ozone vs. temperature, NYC 1973")That's ~25 lines for an entire analysis. That is the leverage R gives you.
The full challenge
Now you do one yourself. Use the built-in iris dataset.
Compute, for each species, the median and mean petal
length, into a tibble (or data frame) named iris_summary with
three columns: Species, median_petal_length,
mean_petal_length.
Using dplyr, build iris_summary from iris: group by Species and summarise to produce columns median_petal_length and mean_petal_length. There should be 3 rows (one per species).
You just did a complete analysis: question → data → cleaning → summary → visualization → interpretation. That's the loop. Every real analysis is a longer, deeper version of the same loop.
Test your understanding
What's the very first thing you should do after loading any new dataset?
Run a regression.
Make a plot.
Inspect it — dim(), head(), str(), summary() — so you understand what's actually there before assuming anything.
Compute the mean of every column.
In the analysis above, we showed Ozone ~ Temp is strongly correlated. That means:
Temperature causes ozone.
The two vary together strongly in this dataset — a real association, but correlation alone doesn't prove causation; other plausible factors (sunlight, emissions, weather patterns) are not ruled out.
The model is broken.
Ozone causes temperature.
What's the actual deliverable of an analysis like this one?
The raw data.
The R code.
A clearly-written, honest interpretation of what was found, backed up by reproducible code and clear visualizations.
A p-value.
You've reached the last conceptual page of the course. The next page wraps everything up and points you to what's next.
Scripts and Projects
A single .R file is a script. A folder full of related scripts, data, and outputs is a project. Treating analysis as a project — not a notebook of one-off commands — is what makes it reproducible.
Next Steps
You've finished the course. Here's a curated map of where to go next — packages to learn, books to read, and habits that will keep you growing as a data analyst.