Dataslope logoDataslope

Relationships Between Variables

Scatterplots, correlations, and cross-tabulations — the toolkit for asking "does X have anything to do with Y?" and (importantly) interpreting the answer carefully.

So far we've explored columns one at a time. But most interesting questions are relational: does increasing X tend to go with increasing Y? Are some categories overrepresented in some groups? Is what we're seeing a real signal, or just noise?

This page introduces the three core tools:

  • Scatterplots for two numeric variables
  • Correlation for measuring how strongly two numeric variables move together
  • Cross-tabulation for two categorical variables

And it ends with two cautionary words: correlation is not causation.

Scatterplots

The simplest, most informative two-variable plot: one dot per observation, position by (x, y).

Code Block
R 4.6.0

Two things to look for in a scatterplot:

  1. Direction — do the dots tend to rise (positive relationship) or fall (negative) from left to right?
  2. Strength — do the dots hug a clear line, or are they a diffuse cloud?

For mtcars: heavier cars get worse mileage, and the relationship is very tight. (This is unsurprising, but it's reassuring when the data agrees with intuition.)

The Pearson correlation measures linear association on a scale from -1 to +1:

  • +1: perfect positive linear relationship
  • 0: no linear relationship
  • -1: perfect negative linear relationship
Code Block
R 4.6.0

For a quick overview, cor() accepts a whole data frame and returns a matrix:

Code Block
R 4.6.0

A few rules to remember about correlation:

  • It only measures linear association. A perfect curve can have correlation near zero.
  • It is sensitive to outliers. A single far-away point can swing it dramatically.
  • Like the mean, it can be misleading. Plot, then correlate, never the other way around.

A glance at the four "famous" patterns

Two variables can be highly related (curve, for example, or a clear group separation) while their correlation is zero. That's why scatterplots are non-negotiable.

Categorical × categorical: cross-tabulation

When both variables are categorical, you don't have positions — you have counts of combinations. table() does this in base R:

Code Block
R 4.6.0

Read the second table as: "of all 4-cylinder cars, what fraction were manual transmission?" You can already see (in mtcars) that the smaller-cylinder cars tend to be manual; the bigger ones tend to be automatic.

Categorical × numeric: boxplots and grouped summaries

When one variable is categorical and one is numeric, a boxplot per group is hard to beat — and grouped summaries put numbers on what you see:

Code Block
R 4.6.0
Code Block
R 4.6.0

Setosa is clearly different from the other two. Versicolor and virginica overlap but differ. The picture and the table tell the same story.

The most important warning: correlation ≠ causation

Two variables can be correlated without one causing the other. Possibilities:

  1. X causes Y (the interesting case)
  2. Y causes X (reverse causation)
  3. A third thing Z causes both (confounding)
  4. Pure coincidence (especially in noisy or short data)

Ice cream sales and drowning deaths are correlated. Ice cream does not cause drowning. Summer causes both.

Establishing causation requires either a randomized experiment (rare in observational data) or careful causal reasoning beyond what summary statistics alone can do.

Beginners often jump from "X and Y are correlated" to "X causes Y." Don't. The correlation is a clue, not a conclusion.

A tiny case study: do bigger engines really use more fuel?

Code Block
R 4.6.0

Correlation around -0.85. Clearly negative. But — is engine displacement causing the worse mileage, or is it that bigger cars tend to have both bigger engines and more weight, and weight is the real driver? Plot wt vs disp and you'll see how tightly those two go together. We can't separate them just by looking at the correlation matrix.

This is confounding, and it's why the next steps in any real analysis usually involve multivariate methods (regression, modeling) — which are beyond this course but worth knowing as the natural next step.

Test your understanding

QuestionSelect one

The correlation between two variables is computed as -0.92. What does that imply?

They are unrelated.

One causes the other.

They have a strong negative linear relationship — when one tends to go up, the other tends to go down.

One of the variables has more NAs than the other.

QuestionSelect one

Two variables have correlation 0. Which conclusion is always justified?

They are completely independent.

They have no relationship of any kind.

They have no linear relationship. They might still have a curved or otherwise structured relationship — only a plot will tell you.

The dataset is too small.

QuestionSelect one

Ice cream sales and drowning rates are strongly correlated. The most likely explanation is:

Ice cream causes drowning.

Drowning causes ice cream sales (e.g., grieving families binge).

A third variable — summer/hot weather — causes both. This is a confounding variable.

Pure coincidence.

Mini challenge: build a correlation table

For mtcars, compute the correlation matrix of just mpg, hp, wt, and qsec, rounded to 2 decimals. Assign it to cor_mat.

Challenge
R 4.6.0
A small correlation matrix

Build cor_mat — a 4×4 numeric matrix of correlations among the columns mpg, hp, wt, and qsec, rounded to 2 decimal places.

We've now seen the analytical workhorses. The next section is about making your findings visible and persuasive — the art of data visualization.

On this page