Relationships Between Variables
Scatterplots, correlations, and cross-tabulations — the toolkit for asking "does X have anything to do with Y?" and (importantly) interpreting the answer carefully.
So far we've explored columns one at a time. But most interesting questions are relational: does increasing X tend to go with increasing Y? Are some categories overrepresented in some groups? Is what we're seeing a real signal, or just noise?
This page introduces the three core tools:
- Scatterplots for two numeric variables
- Correlation for measuring how strongly two numeric variables move together
- Cross-tabulation for two categorical variables
And it ends with two cautionary words: correlation is not causation.
Scatterplots
The simplest, most informative two-variable plot: one dot per observation, position by (x, y).
Two things to look for in a scatterplot:
- Direction — do the dots tend to rise (positive relationship) or fall (negative) from left to right?
- Strength — do the dots hug a clear line, or are they a diffuse cloud?
For mtcars: heavier cars get worse mileage, and the relationship is very tight. (This is unsurprising, but it's reassuring when the data agrees with intuition.)
Correlation: one number for "how related?"
The Pearson correlation measures linear association on a scale from -1 to +1:
- +1: perfect positive linear relationship
- 0: no linear relationship
- -1: perfect negative linear relationship
For a quick overview, cor() accepts a whole data frame and
returns a matrix:
A few rules to remember about correlation:
- It only measures linear association. A perfect curve can have correlation near zero.
- It is sensitive to outliers. A single far-away point can swing it dramatically.
- Like the mean, it can be misleading. Plot, then correlate, never the other way around.
A glance at the four "famous" patterns
Two variables can be highly related (curve, for example, or a clear group separation) while their correlation is zero. That's why scatterplots are non-negotiable.
Categorical × categorical: cross-tabulation
When both variables are categorical, you don't have positions —
you have counts of combinations. table() does this in base R:
Read the second table as: "of all 4-cylinder cars, what fraction were manual transmission?" You can already see (in mtcars) that the smaller-cylinder cars tend to be manual; the bigger ones tend to be automatic.
Categorical × numeric: boxplots and grouped summaries
When one variable is categorical and one is numeric, a boxplot per group is hard to beat — and grouped summaries put numbers on what you see:
Setosa is clearly different from the other two. Versicolor and virginica overlap but differ. The picture and the table tell the same story.
The most important warning: correlation ≠ causation
Two variables can be correlated without one causing the other. Possibilities:
- X causes Y (the interesting case)
- Y causes X (reverse causation)
- A third thing Z causes both (confounding)
- Pure coincidence (especially in noisy or short data)
Ice cream sales and drowning deaths are correlated. Ice cream does not cause drowning. Summer causes both.
Establishing causation requires either a randomized experiment (rare in observational data) or careful causal reasoning beyond what summary statistics alone can do.
Beginners often jump from "X and Y are correlated" to "X causes Y." Don't. The correlation is a clue, not a conclusion.
A tiny case study: do bigger engines really use more fuel?
Correlation around -0.85. Clearly negative. But — is engine
displacement causing the worse mileage, or is it that bigger
cars tend to have both bigger engines and more weight, and
weight is the real driver? Plot wt vs disp and you'll see
how tightly those two go together. We can't separate them just
by looking at the correlation matrix.
This is confounding, and it's why the next steps in any real analysis usually involve multivariate methods (regression, modeling) — which are beyond this course but worth knowing as the natural next step.
Test your understanding
The correlation between two variables is computed as -0.92. What does that imply?
They are unrelated.
One causes the other.
They have a strong negative linear relationship — when one tends to go up, the other tends to go down.
One of the variables has more NAs than the other.
Two variables have correlation 0. Which conclusion is always justified?
They are completely independent.
They have no relationship of any kind.
They have no linear relationship. They might still have a curved or otherwise structured relationship — only a plot will tell you.
The dataset is too small.
Ice cream sales and drowning rates are strongly correlated. The most likely explanation is:
Ice cream causes drowning.
Drowning causes ice cream sales (e.g., grieving families binge).
A third variable — summer/hot weather — causes both. This is a confounding variable.
Pure coincidence.
Mini challenge: build a correlation table
For mtcars, compute the correlation matrix of just mpg,
hp, wt, and qsec, rounded to 2 decimals. Assign it to
cor_mat.
Build cor_mat — a 4×4 numeric matrix of correlations among the columns mpg, hp, wt, and qsec, rounded to 2 decimal places.
We've now seen the analytical workhorses. The next section is about making your findings visible and persuasive — the art of data visualization.
Exploring Distributions
Histograms, density plots, and boxplots — three ways of *seeing* the entire shape of a column at once. The visual companion to summary statistics.
Principles of Visualization
Before you learn a plotting library, learn what makes a chart good. A short tour of the timeless rules: encode well, declutter ruthlessly, tell one story.