Scripts and Projects
A single .R file is a script. A folder full of related scripts, data, and outputs is a project. Treating analysis as a project — not a notebook of one-off commands — is what makes it reproducible.
In an interactive console, you can do anything. Type a command, inspect a result, type another. That's wonderful for exploration — and terrible for reproducing an analysis a week, a month, or a year later.
The remedy is to treat your work as a project: a folder on disk with a predictable structure, R scripts that run top-to- bottom, data kept in known locations, and outputs written in ways anyone (including future-you) can re-run.
A minimal project layout
A common starter layout:
my-analysis/
├── README.md
├── data/
│ ├── raw/
│ │ └── sales.csv
│ └── processed/
│ └── sales_clean.csv
├── R/
│ ├── 01-load.R
│ ├── 02-clean.R
│ ├── 03-analyze.R
│ └── 04-plot.R
└── output/
├── summary.csv
└── trend.pngA few conventions worth adopting from day one:
- Never edit
data/raw/. Treat it as read-only. All transformations produce new files indata/processed/. - Number your scripts.
01-,02-,03-makes the order unambiguous and tab-completion easy. - Write a
README.md. Even 5 sentences explaining what the project does and how to run it pays for itself the first time you return after a break.
Scripts vs. interactive sessions
A script is just a .R file you can run top-to-bottom. The
discipline is: anything important should live in a script. If
you fix a bug in the console, copy that fix back into the
script. The script is the source of truth; the console is a
scratchpad.
A typical analysis script:
# R/02-clean.R
# Clean the raw sales CSV.
library(readr)
library(dplyr)
raw <- read_csv("data/raw/sales.csv")
clean <- raw |>
filter(!is.na(amount)) |>
mutate(
date = as.Date(date),
amount = as.numeric(amount),
category = tolower(trimws(category))
)
write_csv(clean, "data/processed/sales_clean.csv")Notice:
- It declares its dependencies at the top with
library()calls. - It reads from
data/raw/and writes todata/processed/. - It would run identically on any machine that has the same packages and the same input file.
source() for orchestration
You can chain scripts with source(), which runs another R file
in your current session:
# run-all.R
source("R/01-load.R")
source("R/02-clean.R")
source("R/03-analyze.R")
source("R/04-plot.R")
cat("Done!\n")Now you have a single command — Rscript run-all.R — that
reproduces the entire analysis from raw data to final outputs.
Working directory & paths
A common source of "it worked on my computer" pain is hard-coded
paths like "C:/Users/Ada/Documents/data/sales.csv". Don't do
that. Use relative paths anchored to the project root, like
"data/raw/sales.csv". Tools like RStudio Projects or the
here package make this even safer:
Relative paths + a consistent project root = portable code.
Bringing scripts into a notebook with R Markdown / Quarto
R Markdown (.Rmd) and its successor Quarto (.qmd) let
you mix prose, code, and outputs in one document — perfect for
analyses you want to share as readable reports.
A tiny Quarto-style example (don't run this — it's just the flavor):
---
title: "Sales analysis"
author: "Ada Lovelace"
format: html
---
## Setup
```{r}
library(dplyr)
sales <- read.csv("data/processed/sales_clean.csv")
```
## Monthly totals
```{r}
sales |>
group_by(month = format(as.Date(date), "%Y-%m")) |>
summarise(total = sum(amount))
```When you render the document, it runs all code blocks and weaves the outputs into the final HTML/PDF. You get a single self-describing report that anyone can re-run.
Version control (just a taste)
You don't need to master git to benefit from it. A few habits:
- Initialize a git repo in your project folder:
git init. - Commit often, with messages that explain why.
- Don't commit large raw data. Add it to
.gitignore. - Push to a remote (GitHub, GitLab, etc.) for backup and collaboration.
Git turns "I overwrote yesterday's file" from a tragedy into "oh, let me check the history."
Package management
A frequent source of broken analyses: "I updated dplyr and now my code doesn't work." The standard fix is to pin package versions per project. Two common tools:
renv— creates a project-local library and a lockfile (renv.lock) with exact package versions.packrat— older predecessor ofrenv.
A typical workflow:
# install.packages("renv")
renv::init() # set up the project library
# ... install packages, do work ...
renv::snapshot() # record current versions to renv.lockNow anyone (including future-you on a new laptop) can run
renv::restore() to install the exact versions used.
Checklist for a reproducible project
When you finish a project, run through this list:
- ☐ Does the
README.mdexplain the project and how to run it? - ☐ Is raw data preserved unchanged in
data/raw/? - ☐ Do scripts run top-to-bottom without manual steps?
- ☐ Are there no hard-coded absolute paths?
- ☐ Are package versions pinned (renv, DESCRIPTION) or at least listed in the README?
- ☐ Are outputs (plots, tables) regenerated from scripts — not hand-edited?
- ☐ Is the project under version control?
A project that satisfies all 7 is one a stranger could pick up and rerun. That is the single highest-leverage skill an analyst can develop.
Test your understanding
Why should you never edit files in data/raw/?
They're owned by the OS.
Raw data is the immutable ground truth — preserving it means anyone can re-derive every cleaned output by re-running scripts, and accidental edits can't silently corrupt your analysis.
R can't write to that folder.
It's a tradition with no real reason.
What's the main problem with a hard-coded path like "C:/Users/Ada/Documents/data/sales.csv" in a script?
It's longer to type.
It won't work on anyone else's machine (or even yours, after you reorganize your files) — relative paths anchored to the project root are portable.
R can't read Windows paths.
It only works for CSV files.
What's the purpose of a tool like renv or packrat?
They speed up script execution.
They record the exact package versions a project uses, so the analysis can be reliably re-run later even after packages on CRAN have changed.
They replace base R.
They make scripts run in parallel.
Mini challenge: design a project layout
Imagine you're starting an analysis that pulls daily stock-price data from a CSV, computes monthly returns, and produces a chart. Without writing any code, sketch (in plain text or comments) the directory structure you'd create — folder names, file names, and what each script does.
There's no automated test for this one — but a good sketch would include:
data/raw/anddata/processed/folders- A
R/folder with numbered scripts (e.g.,01-load.R,02-clean.R,03-monthly-returns.R,04-plot.R) - An
output/folder for the chart - A
README.mddescribing the project - An optional
renv.lockfor reproducibility
Compare your sketch to the template above. Did you split "load" from "clean"? Did you keep raw and processed data separate? Did you give yourself a single entry point to re-run everything?
The two remaining pages walk through a complete end-to-end mini analysis and then point you at what to learn next.
Writing Your Own Functions
Functions are how analysis code stays understandable as it grows. Learn to write small, well-named functions that capture intent instead of copy-pasting logic.
Mini Project Walkthrough
A complete end-to-end analysis of a real built-in dataset — load, inspect, tidy, transform, summarize, visualize, interpret. This is the workflow every analysis follows.