Regression Metrics — MAE, MSE, RMSE, R²
A model predicts numbers — but how wrong is it, and does "wrong" mean a few big misses or many small ones? Choosing and reading the right error metric is a skill in itself.
A regression model outputs numbers, so evaluating it means measuring the gap between predicted numbers and true numbers. That sounds simple, but "how wrong is the model?" has several different honest answers, and they can disagree about which of two models is better. This chapter is about understanding each metric deeply enough to choose the right one — and to know what it is quietly not telling you.
Residuals: the raw material
Every regression metric is built from residuals — the differences between actual and predicted values, one per example.
A residual of +3 means the model undershot by 3; -3 means it overshot by
3. A perfect model has all residuals equal to zero. The metrics below are
just different ways to boil a whole vector of residuals down to a single
score — and the way you summarize them encodes what you care about.
MAE — Mean Absolute Error
What it measures. The average size of the errors, ignoring their direction: take the absolute value of each residual and average them.
- Units: the same as the target. If you are predicting house prices in dollars, MAE is in dollars. "On average we are off by about 4,300 dollars" — that is an MAE, and it is wonderfully interpretable.
- What it does not measure: it does not care whether your errors are a few huge misses or many small ones. A model that is off by 10 on every one of 10 houses and a model that is perfect on 9 and off by 100 on one both have the same total absolute error.
- Robust to outliers: because errors are not squared, one giant miss does not dominate. MAE treats a 100-dollar error as exactly ten times a 10-dollar error — no more, no less.
MSE — Mean Squared Error
What it measures. The average of the squared residuals.
Squaring has two big consequences:
- Large errors are punished disproportionately. An error of 10 contributes 100; an error of 20 contributes 400 — four times as much for twice the miss. MSE hates big misses. If a single catastrophic prediction is much worse for you than several mediocre ones, MSE encodes that preference.
- The units are squared and meaningless. If the target is in dollars, MSE is in "dollars squared," which no one can interpret. You cannot tell a stakeholder "our error is 19 million dollars-squared." This is why MSE is great for optimizing and comparing but poor for reporting.
Why squared error is everywhere under the hood
LinearRegression literally minimizes MSE — it finds the line with the
smallest sum of squared residuals. Squaring is also mathematically
convenient (smooth, differentiable). So MSE is the quantity many models
optimize, even when you report something more readable.
RMSE — Root Mean Squared Error
What it measures. The square root of MSE. Taking the root undoes the squaring of the units, so RMSE is back in the target's units — but it keeps MSE's heavy penalty on large errors.
RMSE is the best of both worlds for many problems: interpretable units (like MAE) and sensitivity to big misses (like MSE). A useful fact: RMSE is always greater than or equal to MAE, and the gap between them grows when errors are uneven. If RMSE is much larger than MAE, you have a few outlier predictions doing a lot of damage.
Seeing how an outlier splits MAE from RMSE
The clearest way to feel the difference is to inject one terrible prediction and watch each metric react.
One outlier barely moves MAE but sends RMSE soaring. That is the whole choice in a nutshell: if rare large errors are especially costly (a wildly wrong medical dose, a hugely mispriced trade), prefer RMSE because it screams about them. If all errors hurt in proportion to their size and you do not want a few outliers to dominate the score, prefer MAE.
R² — the coefficient of determination
MAE, MSE, and RMSE tell you the error in the target's units, but they cannot tell you whether that error is good. Is an RMSE of 50 impressive? It depends entirely on the scale and spread of the target. R² answers a different, scale-free question: how much better is my model than just predicting the average every time?
The numerator is your model's squared error; the denominator is the squared error of a dumb baseline that always predicts the mean. So:
- R² = 1.0 — perfect predictions, zero error.
- R² = 0.0 — your model is no better than always guessing the mean.
- R² < 0 — your model is worse than guessing the mean. Yes, R² can be negative, and on a bad model with a held-out test set it sometimes is.
R² is NOT accuracy, and NOT a percentage of correct predictions
The single most common R² mistake is reading "R² = 0.85" as "the model is 85% accurate" or "right 85% of the time." It means nothing of the sort. R² is the fraction of the target's variance that the model explains relative to a mean baseline. A model can have R² = 0.85 and still be off by a large, business-critical amount on every single prediction.
What R² does not tell you
- Not the size of the error. Two datasets with very different RMSE can have the same R², because R² is relative to each dataset's own variance. Always report an absolute metric (MAE or RMSE) alongside R².
- Not whether the model is appropriate. A high R² can come from overfitting, from a lurking outlier inflating the variance, or from a nonlinear pattern that the model happens to partly capture. Look at a residual plot, not just the number.
- Not comparable across different datasets. R² depends on how spread out the target is. A "low" R² on an intrinsically noisy problem can represent a better model than a "high" R² on an easy one.
Residual plots: the picture every R² hides
A residual plot — residuals versus predictions — reveals problems that no single number can. For a good linear model, residuals should scatter randomly around zero with no pattern.
If you see a funnel (errors grow with the prediction), a curve (the model missed a nonlinear pattern), or a few points stranded far from the rest (outliers), the metric alone would never have warned you. Always look.
Putting them side by side
| Metric | Units | Punishes big errors extra? | Robust to outliers? | Interpretable alone? |
|---|---|---|---|---|
| MAE | target units | No | Yes | Yes |
| MSE | target units squared | Yes (heavily) | No | No |
| RMSE | target units | Yes | No | Yes |
| R² | none (ratio) | via squared error | No | Only vs a mean baseline |
A solid default is to report RMSE (or MAE) for the error size and R² for the context, and to glance at a residual plot before trusting any of them.
A practical reporting recipe
"Our model predicts charges with an RMSE of about 4,300 dollars (MAE 3,100), explaining roughly 78% of the variance (R² = 0.78)." That one sentence gives the error size, the outlier sensitivity, and the context — far more honest than any single number.
Common misconceptions
- "Lower MSE is always a better model." Only on the same data. MSE drops as you overfit the training set; compare on held-out data, and remember its units are not interpretable.
- "R² of 0.9 means 90% correct." No — see the callout above. R² is explained variance, not accuracy.
- "A negative R² is a bug." It is a legitimate signal that your model is worse than predicting the mean — usually a sign of severe overfitting or a mismatched model.
- "RMSE and MAE rank models the same way." Usually, but not when outliers are involved. RMSE can prefer a model that avoids big misses while MAE prefers one with a lower typical error. Choose based on what costs you more.
Real-world applications
A delivery-time predictor might optimize MAE because every minute of error annoys a customer equally. A power-grid load forecaster cares enormously about rare large misses (blackouts) and so leans on RMSE. A scientist reporting how well a variable is explained reaches for R². The metric is not a formality — it is a statement about which mistakes you are willing to tolerate.
Your turn
A LinearRegression is already fit on the diabetes training set,
and y_test / y_pred are available.
Compute and store:
mae— the mean absolute error,rmse— the root mean squared error (useroot_mean_squared_errorormean_squared_error(...) ** 0.5),r2— the R² score,
all comparing y_test to y_pred.
The tests verify each value, confirm rmse >= mae (always true), and
confirm r2 is between 0 and 1 for this model.
Check your understanding
What is the key practical advantage of RMSE over MSE for reporting results?
RMSE is always smaller, so it looks better
RMSE ignores outliers entirely
RMSE is in the same units as the target, making it interpretable, while MSE is in squared units that no one can read
RMSE cannot be computed for held-out data
You compare two models. Model A has MAE 5, RMSE 6. Model B has MAE 5, RMSE 15. What does the difference most likely indicate?
Model B is better because RMSE is higher
The models are identical
Model B has a few large outlier errors; its typical error is the same, but big misses inflate its RMSE far above its MAE
Model B has lower variance
A regression model reports R² = 0.85. Which interpretation is correct?
The model is correct on 85% of predictions
Predictions are within 85% of the true values
The model explains about 85% of the variance in the target, relative to a baseline that always predicts the mean
The model has 85% precision
On a held-out test set, a model produces R² = -0.20. What does this mean?
The calculation is invalid; R² cannot be negative
The model is 20% accurate
The model performs worse than simply predicting the mean of the target — a red flag, often from overfitting or a mismatched model
The model explains 20% of the variance
Why should you report an absolute metric (MAE or RMSE) alongside R² rather than R² alone?
Because R² is always wrong
Because R² is scale-free and says nothing about the actual size of the errors; the same R² can correspond to tiny or huge errors depending on the target's spread
Because MAE and R² are identical
Because R² cannot be computed without RMSE
When is MAE generally preferable to RMSE as your error metric?
When you want big mistakes to dominate the score
When the target has no units
When all errors should count in proportion to their size and you do not want a few outliers to dominate the metric
When you are minimizing squared error during training
Linear Regression
The straight line through your data — the simplest, most interpretable way to predict a number, and the foundation every other regression model is measured against.
Logistic Regression
Despite its name, this is a classification algorithm — a linear score squashed into a probability, then a threshold. The interpretable workhorse for predicting yes-or-no.