Thinking in Data
Before writing a line of R, what does it mean to "think computationally" about data? An introduction to the mental shift from doing arithmetic by hand to instructing a computer to do it for you.
This is the most important page in the course.
If you skim everything else and only read one page carefully, make it this one — because the mental shift it describes is the difference between using R well and fighting it forever.
A different way of thinking
When you do arithmetic by hand, you think one number at a time. "What is 5 times 7? Thirty-five. Write that down. Next: what is 8 times 9? Seventy-two. Write that down."
When you do arithmetic on a computer, you think one collection at a time. "I have a column of widths and a column of heights — I want a column of areas. Compute it."
That subtle shift — from number at a time to collection at a time — is the entire essence of computational thinking for data analysis.
Both approaches give the same numeric answer. But the computational approach scales. Three rows? Three hundred? Three million? The expression is the same.
A first taste
Watch how this looks in R. We will compute the area of three rectangles.
Three things happened here, and all three are worth pausing on:
- We named two collections of numbers (
widthsandheights) instead of writing them out repeatedly. - We applied an operation to entire collections at once
(
widths * heights). - We named the result (
areas) so we can use it again.
That is the basic loop of computational thinking for data analysis: name things, operate on whole collections, name the result. Every R analysis in this course is some elaboration of that loop.
Why this matters
To see why, imagine you suddenly had 1,000 rectangles instead of 3.
In the hand-arithmetic way of thinking, you would need to do multiplications 1,000 times. With a spreadsheet, you could drag a formula down 1,000 rows. With R, your code does not change at all — just the input does:
The line areas <- widths * heights is byte-for-byte identical
to the one in the previous example. Three rectangles or three
thousand — the computer does the work either way. The code
describes the operation, not the count.
This is huge. It means the same code you write to explore a tiny sample dataset will work on the full production dataset. It means your mental model does not have to scale just because your data does.
"Computational thinking," in plainer words
The phrase computational thinking is sometimes made to sound fancy. For our purposes it just means cultivating a few habits:
- Think in collections, not in individual values. Whenever you feel the urge to write "for each row…", first ask: "is there a way to express this as one operation on the whole column?"
- Name things meaningfully. A variable called
xis a guess; a variable calledpediatric_visits_2024is a sentence. Future you will thank present you. - Build up from simple steps. Big analyses are not written all at once. They are stacks of tiny steps, each one of which can be inspected and verified.
- Trust the computer for the tedious parts, but check the important parts. Computers do not get tired or make transcription errors — but they will happily compute the wrong thing if you ask them to.
- Treat data as something to interrogate, not just report on. A dataset is not the answer to a question; it is a thing you investigate to develop an answer.
These habits are not specific to R. Anyone working with data — in Python, in SQL, in a spreadsheet, even with paper — benefits from them. But R is especially well-suited to them, because the language is shaped around the idea that collections are the basic unit of work.
Patterns we will see again and again
Three patterns underlie almost every R analysis we will write later in this course. Recognizing them now will make all the specific syntax easier to absorb.
Pattern 1: Build a collection, operate on it
heights_cm <- c(168, 172, 165, 180)
heights_m <- heights_cm / 100Notice: one operation, applied to every element.
Pattern 2: Combine two collections elementwise
height <- c(1.68, 1.72, 1.65, 1.80)
weight <- c(63, 70, 58, 85)
bmi <- weight / height^2Two collections in, one collection out.
Pattern 3: Summarize a collection
heights <- c(168, 172, 165, 180)
mean(heights) # one number out of many
median(heights)
sd(heights)This is the "many in, one out" direction — the basic gesture of statistics.
Almost every analysis you will write is a stacked combination of those three moves.
A small worked example
Suppose we have monthly sales for a year and we want to ask: in which months did sales beat the annual average?
That five-line analysis used all three patterns: a summary
(mean), an elementwise combine (sales > avg_sales), and a map
(months[above_avg]). The code reads almost like English: take
the mean, mark months above it, pick those names.
If next year you have 24 months of data, you change nothing except the inputs.
Why loops are usually unnecessary in R
People coming from languages like C, Java, or Python often expect data analysis code to look like this:
# Not idiomatic R
result <- numeric(length(sales))
for (i in seq_along(sales)) {
result[i] <- sales[i] - mean(sales)
}In R, you almost never have to write that. The same idea is one line and is dramatically clearer:
result <- sales - mean(sales)R interprets sales - mean(sales) as "subtract the mean from each
element of sales," giving you a vector of deviations. This is
called vectorization, and we will dedicate an entire page to it
soon. For now, just absorb the cultural rule: if you find
yourself writing a for loop, look hard for a vectorized
expression first.
Test your understanding
Which statement best captures the shift from hand arithmetic to computational thinking?
"Computers are faster, but the way you think about a problem stays exactly the same."
"Computers think in numbers; humans think in patterns."
"Hand arithmetic operates one number at a time; computational thinking operates on whole collections at a time."
"Computational thinking means avoiding statistics."
Which of the following best describes what happens when you write widths * heights in R, where each is a vector of length 3?
R multiplies the lengths together to get 9.
R multiplies the two vectors elementwise and returns a new vector of length 3.
R picks the first element of each and multiplies those.
R raises an error because you can't multiply two vectors.
In R, why is writing a for loop over a column of data often considered non-idiomatic?
Because R does not support loops.
Because loops always crash.
Because most operations can be expressed more clearly and concisely with vectorized expressions on whole columns.
Because loops are illegal in statistical analysis.
A small challenge
Given a vector of temperatures in Fahrenheit, compute a vector of
the same length giving the temperature in Celsius. The formula is
C = (F - 32) * 5 / 9. Do not use a loop — write one
vectorized expression.
Given the vector temps_f (already loaded), create a vector temps_c of the same length, where each element is the Celsius equivalent of the corresponding Fahrenheit value. Use a single vectorized expression — no loops, no sapply.
In the next page we will finally run our first R program from scratch and get comfortable with the running surface of WebR.
Reproducible Analysis
Why "click around in a spreadsheet" is no longer a defensible way to analyze important data — and how R, R Markdown, and Quarto have made reproducibility the norm in modern science and business.
Your First R Program
Run R for the first time, learn what the prompt is doing, understand how output appears, and write a tiny program that does something useful.