Mini Project Walkthrough

A complete end-to-end analysis of a real built-in dataset — load, inspect, tidy, transform, summarize, visualize, interpret. This is the workflow every analysis follows.

You've now met every piece of the puzzle: vectors, data frames, dplyr verbs, ggplot, summary statistics, and inference. This page strings them all together on a real dataset so you can see the whole arc of an analysis at once.

We'll use the built-in airquality dataset — daily air quality measurements from New York during May–September 1973. Our question:

Is ozone higher on hotter days, and how does it vary across the summer months?

That single question will pull us through the entire workflow.

Step 1 — Load and inspect

Look before you leap. Always.

What we learn:

~153 rows, 6 columns: Ozone, Solar.R, Wind, Temp, Month, Day.
Ozone and Solar.R have NAs — missing values we'll need to handle.
Month is numeric (5 through 9) but conceptually categorical.
Temp is in degrees Fahrenheit.

Step 2 — Tidy: handle NAs and fix types

A few notes:

We dropped rows missing Ozone because Ozone is the variable we want to analyze. (Solar.R NAs we'll handle later if needed.) Dropping rows is fine when you've thought about whether NAs are random — for now, assume yes.
We added a Month_name factor with proper ordering, and a more reader-friendly Temp_C column.

Step 3 — First look: summary statistics

Already the pattern emerges: July and August have the highest mean Ozone and the highest mean temperature. May and September are cooler and cleaner.

Step 4 — Visualize the relationship

Plots will tell us more than summary tables ever can.

The picture is unambiguous: warmer days have higher Ozone. The relationship looks roughly monotonic, and a bit curvier at high temperatures.

Step 5 — Look at monthly variation explicitly

July and August have clearly higher medians and wider spreads — both more Ozone and more variable Ozone. The shoulder months (May, September) are tighter and lower.

Step 6 — A quick formal check

We see a relationship between Temperature and Ozone, but how strong is it? And is it surprising given the noise?

Reading the output:

The correlation is around 0.7 — a strong positive linear relationship.
The regression coefficient on Temp_C is positive and the p-value is very small, well below any conventional cutoff.
The R² (multiple R-squared in the output) tells you what fraction of Ozone variation is explained by temperature alone.

We're not making causal claims (Ozone formation has many drivers!) — only describing a strong association in this dataset.

Step 7 — Interpret

Pulling it together in a short, plain-English paragraph (this is literally the kind of write-up an analyst delivers):

Across the 1973 New York summer, daily ozone levels rose sharply with daily temperature. The relationship is strong (correlation ≈ 0.7) and is very unlikely to be a chance pattern given the sample size. Ozone was both higher and more variable in July and August than in May or September, consistent with the known link between heat, sunlight, and photochemical ozone production. About half the day-to-day variation in ozone is explained by temperature alone; the rest is presumably driven by factors not in this dataset, like wind, precursor emissions, and atmospheric mixing.

That paragraph is the deliverable. Everything before it — loading, cleaning, plotting, modeling — was scaffolding to make it trustworthy.

Putting it all in one script

Here's the entire analysis condensed:

# air-quality-analysis.R
library(dplyr)
library(ggplot2)

# 1. Load + tidy
aq <- airquality |>
  filter(!is.na(Ozone)) |>
  mutate(
    Month_name = factor(Month, levels = 5:9,
                        labels = c("May","Jun","Jul","Aug","Sep")),
    Temp_C     = (Temp - 32) * 5/9
  )

# 2. Summarize
monthly <- aq |>
  group_by(Month_name) |>
  summarise(
    n        = n(),
    median_O = median(Ozone),
    mean_O   = mean(Ozone),
    mean_T   = mean(Temp_C),
    .groups  = "drop"
  )
print(monthly)

# 3. Model
cat("Correlation:", cor(aq$Temp_C, aq$Ozone), "\n")
print(summary(lm(Ozone ~ Temp_C, data = aq)))

# 4. Plot
ggplot(aq, aes(Temp_C, Ozone, color = Month_name)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "black") +
  labs(title = "Ozone vs. temperature, NYC 1973")

That's ~25 lines for an entire analysis. That is the leverage R gives you.

The full challenge

Now you do one yourself. Use the built-in iris dataset. Compute, for each species, the median and mean petal length, into a tibble (or data frame) named iris_summary with three columns: Species, median_petal_length, mean_petal_length.

Using dplyr, build iris_summary from iris: group by Species and summarise to produce columns median_petal_length and mean_petal_length. There should be 3 rows (one per species).

You just did a complete analysis: question → data → cleaning → summary → visualization → interpretation. That's the loop. Every real analysis is a longer, deeper version of the same loop.

Test your understanding

QuestionSelect one

What's the very first thing you should do after loading any new dataset?

Run a regression.

Make a plot.

Inspect it — dim(), head(), str(), summary() — so you understand what's actually there before assuming anything.

Compute the mean of every column.

QuestionSelect one

In the analysis above, we showed Ozone ~ Temp is strongly correlated. That means:

Temperature causes ozone.

The two vary together strongly in this dataset — a real association, but correlation alone doesn't prove causation; other plausible factors (sunlight, emissions, weather patterns) are not ruled out.

The model is broken.

Ozone causes temperature.

QuestionSelect one

What's the actual deliverable of an analysis like this one?

The raw data.

The R code.

A clearly-written, honest interpretation of what was found, backed up by reproducible code and clear visualizations.

A p-value.

You've reached the last conceptual page of the course. The next page wraps everything up and points you to what's next.

Scripts and Projects

A single .R file is a script. A folder full of related scripts, data, and outputs is a project. Treating analysis as a project — not a notebook of one-off commands — is what makes it reproducible.

Next Steps

You've finished the course. Here's a curated map of where to go next — packages to learn, books to read, and habits that will keep you growing as a data analyst.

Step 1 — Load and inspect Step 2 — Tidy: handle NAs and fix types Step 3 — First look: summary statistics Step 4 — Visualize the relationship Step 5 — Look at monthly variation explicitly Step 6 — A quick formal check Step 7 — Interpret Putting it all in one script The full challenge Test your understanding

Mini Project Walkthrough

On this page