Writing Your Own Functions

Functions are how analysis code stays understandable as it grows. Learn to write small, well-named functions that capture intent instead of copy-pasting logic.

So far, almost every code block we've written has been a linear script: do a thing, then do another thing. That works for small, one-off analyses. But the moment you find yourself writing the same five lines for the fifth dataset, you've hit the wall that functions were invented to break through.

A function is just a named, reusable recipe. You package up some logic, give it a name, declare what it needs, and from then on you call it by name. Your analysis script reads like a story ("clean, transform, summarize, plot") instead of like the inside of a kitchen ("chop, chop, chop, stir, stir, chop, stir, …").

Anatomy of a function

A function in R has four parts:

A name (the variable you assign it to).
A list of parameters (the names the function uses internally for its inputs).
A body (the R code that runs).
A return value (by default, the value of the last expression — or whatever you wrap in return()).

The simplest possible function:

Note: because * is vectorized, our square() function also works on whole vectors. We didn't have to write a loop.

Defaults and named arguments

You can give parameters default values, and callers can specify arguments by name in any order:

Defaults are wonderful: they make the common case easy while keeping the uncommon case possible.

A useful, real-world function

Let's write something concrete: a function that takes a numeric vector and returns a small named list summarizing it.

That single function replaces ~7 lines you'd otherwise rewrite every time you wanted a summary. The benefits compound: if you later want to add quantile() or change the NA handling, you edit one place.

The DRY principle: Don't Repeat Yourself

Look at this code. What's the pattern?

mean_mpg  <- mean(mtcars$mpg)
mean_hp   <- mean(mtcars$hp)
mean_wt   <- mean(mtcars$wt)
mean_qsec <- mean(mtcars$qsec)

Four nearly identical lines. The moment you have to compute one more mean, you copy-paste again. That's a smell. A function removes the repetition:

Rule of thumb: if you see yourself typing the same shape of code 3+ times with small variations, lift it into a function (or find an existing one — chances are someone already wrote it).

Functions for transformations

Functions shine in data transformation. Here's one that standardizes a numeric vector (subtract mean, divide by SD — useful for comparing variables on different scales):

You can now apply it to many columns:

Without the function, you'd repeat the standardization expression once per column.

Function design tips

Name verbs. Functions do things: compute_*, clean_*, summarize_*, plot_*. Avoid f1, helper, do_stuff.
One job per function. A function called load_and_clean() is a code smell — split it into load_data() and clean_data().
Keep it small. If a function won't fit on a screen, it's probably trying to do too much.
Predictable inputs and outputs. Document what the function expects and returns (even if just a comment).
Pure functions are easier to reason about. A function that only depends on its inputs (no global state) is much easier to test and reuse.

Pure vs. side-effecting

A "pure" function returns a value and changes nothing else. A "side-effecting" function does things to the outside world (prints, plots, writes a file, modifies a global). Both are useful — but mixing them inside one function makes code hard to reason about:

# Mixed concerns — hard to reuse
analyze_and_print <- function(x) {
  result <- mean(x) / sd(x)
  cat("Result is:", result, "\\n")  # side effect
  result
}

Better to separate:

invisible() returns a value without auto-printing — useful when the caller might or might not want it.

Scope, briefly

Variables created inside a function live only inside it. They don't leak out to your global environment:

This isolation is a feature: it prevents functions from silently breaking each other. Variables you want to share across the analysis live in the global environment; everything else stays local.

Test your understanding

QuestionSelect one

A function in R returns:

Nothing — only return() returns.

The value of its last expression by default, or an explicit return() value if you use one.

The first expression.

A list of all its expressions.

QuestionSelect one

Why is the DRY ("Don't Repeat Yourself") principle especially valuable in analysis code?

It saves typing.

Every duplication is a place a bug can hide and a place you'll have to remember to update — lifting repeated logic into a function gives you one place to fix it and one place to test.

Functions run faster than inline code.

It makes scripts shorter.

QuestionSelect one

Which is the best name for a function that computes a standardized z-score from a numeric vector?

f

do_thing

standardize

x_value

Mini challenge: write a `summarize_group` function

Write a function called summarize_group(x) that takes a numeric vector and returns a named list with elements n, mean, and sd. Use na.rm = TRUE for both the mean and the SD.

Define summarize_group <- function(x) { ... } so that calling it on a numeric vector returns a list with elements n (length), mean (mean, ignoring NAs), and sd (sd, ignoring NAs).

Functions are the building block of every reusable analysis. On the next page, we'll zoom out one more level: scripts and projects — how to organize a real analysis on disk so future- you (and your collaborators) can actually run it.

Intuition for Inference

Confidence intervals and p-values are the lingua franca of applied statistics — and the most misinterpreted ideas in all of science. Let's build correct intuition for what they really mean.

Scripts and Projects

A single .R file is a script. A folder full of related scripts, data, and outputs is a project. Treating analysis as a project — not a notebook of one-off commands — is what makes it reproducible.

Writing Your Own Functions

On this page