Writing Your Own Functions
Functions are how analysis code stays understandable as it grows. Learn to write small, well-named functions that capture intent instead of copy-pasting logic.
So far, almost every code block we've written has been a linear script: do a thing, then do another thing. That works for small, one-off analyses. But the moment you find yourself writing the same five lines for the fifth dataset, you've hit the wall that functions were invented to break through.
A function is just a named, reusable recipe. You package up some logic, give it a name, declare what it needs, and from then on you call it by name. Your analysis script reads like a story ("clean, transform, summarize, plot") instead of like the inside of a kitchen ("chop, chop, chop, stir, stir, chop, stir, …").
Anatomy of a function
A function in R has four parts:
- A name (the variable you assign it to).
- A list of parameters (the names the function uses internally for its inputs).
- A body (the R code that runs).
- A return value (by default, the value of the last
expression — or whatever you wrap in
return()).
The simplest possible function:
Note: because * is vectorized, our square() function also
works on whole vectors. We didn't have to write a loop.
Defaults and named arguments
You can give parameters default values, and callers can specify arguments by name in any order:
Defaults are wonderful: they make the common case easy while keeping the uncommon case possible.
A useful, real-world function
Let's write something concrete: a function that takes a numeric vector and returns a small named list summarizing it.
That single function replaces ~7 lines you'd otherwise rewrite
every time you wanted a summary. The benefits compound: if you
later want to add quantile() or change the NA handling, you
edit one place.
The DRY principle: Don't Repeat Yourself
Look at this code. What's the pattern?
mean_mpg <- mean(mtcars$mpg)
mean_hp <- mean(mtcars$hp)
mean_wt <- mean(mtcars$wt)
mean_qsec <- mean(mtcars$qsec)Four nearly identical lines. The moment you have to compute one more mean, you copy-paste again. That's a smell. A function removes the repetition:
Rule of thumb: if you see yourself typing the same shape of code 3+ times with small variations, lift it into a function (or find an existing one — chances are someone already wrote it).
Functions for transformations
Functions shine in data transformation. Here's one that standardizes a numeric vector (subtract mean, divide by SD — useful for comparing variables on different scales):
You can now apply it to many columns:
Without the function, you'd repeat the standardization expression once per column.
Function design tips
- Name verbs. Functions do things:
compute_*,clean_*,summarize_*,plot_*. Avoidf1,helper,do_stuff. - One job per function. A function called
load_and_clean()is a code smell — split it intoload_data()andclean_data(). - Keep it small. If a function won't fit on a screen, it's probably trying to do too much.
- Predictable inputs and outputs. Document what the function expects and returns (even if just a comment).
- Pure functions are easier to reason about. A function that only depends on its inputs (no global state) is much easier to test and reuse.
Pure vs. side-effecting
A "pure" function returns a value and changes nothing else. A "side-effecting" function does things to the outside world (prints, plots, writes a file, modifies a global). Both are useful — but mixing them inside one function makes code hard to reason about:
# Mixed concerns — hard to reuse
analyze_and_print <- function(x) {
result <- mean(x) / sd(x)
cat("Result is:", result, "\\n") # side effect
result
}Better to separate:
invisible() returns a value without auto-printing — useful
when the caller might or might not want it.
Scope, briefly
Variables created inside a function live only inside it. They don't leak out to your global environment:
This isolation is a feature: it prevents functions from silently breaking each other. Variables you want to share across the analysis live in the global environment; everything else stays local.
Test your understanding
A function in R returns:
Nothing — only return() returns.
The value of its last expression by default, or an explicit return() value if you use one.
The first expression.
A list of all its expressions.
Why is the DRY ("Don't Repeat Yourself") principle especially valuable in analysis code?
It saves typing.
Every duplication is a place a bug can hide and a place you'll have to remember to update — lifting repeated logic into a function gives you one place to fix it and one place to test.
Functions run faster than inline code.
It makes scripts shorter.
Which is the best name for a function that computes a standardized z-score from a numeric vector?
f
do_thing
standardize
x_value
Mini challenge: write a summarize_group function
Write a function called summarize_group(x) that takes a
numeric vector and returns a named list with elements
n, mean, and sd. Use na.rm = TRUE for both the mean and
the SD.
Define summarize_group <- function(x) { ... } so that calling it on a numeric vector returns a list with elements n (length), mean (mean, ignoring NAs), and sd (sd, ignoring NAs).
Functions are the building block of every reusable analysis. On the next page, we'll zoom out one more level: scripts and projects — how to organize a real analysis on disk so future- you (and your collaborators) can actually run it.
Intuition for Inference
Confidence intervals and p-values are the lingua franca of applied statistics — and the most misinterpreted ideas in all of science. Let's build correct intuition for what they really mean.
Scripts and Projects
A single .R file is a script. A folder full of related scripts, data, and outputs is a project. Treating analysis as a project — not a notebook of one-off commands — is what makes it reproducible.