Linear Regression

The straight line through your data — the simplest, most interpretable way to predict a number, and the foundation every other regression model is measured against.

So far our models have predicted categories: which species, malignant or benign. Now we predict a number: a house price, a patient's disease progression, tomorrow's temperature. That family of tasks is called regression, and the place everyone starts — for good reason — is linear regression.

Linear regression is old, simple, and almost embarrassingly interpretable. It will not win competitions on messy data. But it is the model you reach for first, the baseline every fancier model must beat, and the clearest possible illustration of what "learning from data" even means. Master it and the rest of regression is variations on a theme.

Where this sits in the workflow

Nothing about the four-move workflow changes here. We still load, split, train, and evaluate — we are simply swapping the estimator line for LinearRegression and predicting a number instead of a class. If that shape is not yet automatic, revisit Your First Model, End to End first.

The problem it solves

You have a feature and a number that seems to move with it. Square footage and price. Study hours and exam score. Advertising spend and sales. You want a rule that, given the feature, predicts the number — and ideally a rule you can read, so you can say "each extra square foot is worth about this much."

Linear regression answers with the simplest possible rule: draw a straight line through the cloud of points and read predictions off the line. Give it a new square footage, find that spot on the line, and the height of the line there is your predicted price.

With more than one feature the "line" becomes a flat plane (or, in higher dimensions, a hyperplane you cannot picture), but the idea is unchanged: the prediction is a straight, flat function of the inputs.

The intuition: the best straight line

A line is defined by two numbers: a slope and an intercept. Prediction is then just arithmetic:

predicted y = intercept + slope × feature

The intercept is where the line crosses the vertical axis (the prediction when the feature is zero). The slope is how steeply the line rises — how much the prediction changes for each one-unit increase in the feature.

But which line? Infinitely many lines pass near a cloud of points. Linear regression picks the one that makes the total squared error as small as possible. For each point, the error is the vertical gap between the real value and the line's prediction (the "residual"). We square each gap so that positives and negatives cannot cancel, and so that large misses are penalized hard. The chosen line is the one whose squared gaps add up to the smallest possible total. That criterion has a name — least squares — and it is what LinearRegression solves for you.

Why squared, not absolute?

Squaring the residuals does two useful things: it removes the sign (so a miss of +3 and a miss of -3 both count as 9, not cancel to 0), and it punishes big misses far more than small ones (a miss of 4 costs 16, four times as much as a miss of 2). The result is a line that especially dislikes being badly wrong on any single point — which is usually what we want, but note the flip side: it makes the line sensitive to outliers, a weakness we return to below.

One feature: see the line being fit

Let us make this concrete with a single feature so we can draw the line. We will generate a cloud of points with make_regression, fit a LinearRegression, and plot both the points and the line the model chose.

The red line is the single straight line that minimizes the total squared vertical distance to all those blue points. No other line does better by that measure. Notice it does not pass through every point — it cannot, the points are noisy — it threads through their middle.

coef_ and intercept_: where the learning lives

After .fit(), everything the model learned is stored in two attributes: model.coef_ (the slope, or one slope per feature) and model.intercept_ (the constant). That is the entire model — two-ish numbers. You could write the predictions on paper. This radical simplicity is linear regression's superpower.

Reading the coefficients

The coefficients are not just machinery — they are the answer to a business question. Let us pull them out and interpret them in plain language.

There is the whole interpretation: the slope is the price of one unit of the feature. If the feature were square footage and the target were dollars, a slope of 80 would read "each additional square foot adds about $80 to the predicted price." That single, readable sentence is why linear regression survives in fields — economics, medicine, policy — where being able to explain the model matters as much as its accuracy.

Prediction is a weighted sum

With several features, the model becomes y = intercept + w1*x1 + w2*x2 + w3*x3 + .... Each prediction is just a weighted sum of the features plus a constant, where the weights are the coefficients. "Linear" literally means this: outputs are a straight, additive combination of inputs. No feature is squared, multiplied by another, or bent — unless you add such terms yourself.

Many features: the diabetes dataset

Real problems have many features at once. The load_diabetes dataset has ten medical measurements (age, sex, BMI, blood pressure, and six blood serum values) for 442 patients, and the target is a number measuring disease progression one year later. The model now fits a flat plane through ten-dimensional space — impossible to picture, but the math and the code are identical.

Each coefficient says how the prediction moves per unit of that feature, holding the others fixed. A large positive coefficient on BMI means higher BMI pushes the predicted progression up; a negative coefficient means that feature pulls it down. The biggest magnitudes flag the features the model leans on most — though, importantly, magnitude alone can mislead when features are on different scales, which is one reason scaling matters (its own page).

Coefficients are not plug-and-play importances

It is tempting to rank features by coefficient size and call the biggest ones "most important." Be careful: a coefficient's size depends on the units of its feature. A feature measured in millimeters will get a much larger coefficient than the same feature in meters, with no change in real importance. Compare coefficients fairly only when features are on a common scale. The data-preparation chapter shows how.

Let us see the model actually predict for specific patients and check its overall quality.

The predictions are in the right ballpark but clearly imperfect — disease progression depends on much more than ten measurements. That .score() returns R² (R-squared), not accuracy, because the target is a number.

Metrics get their own page

You will notice .score() gives R² here, and you may have heard of MAE (mean absolute error) and MSE (mean squared error) too. Each answers "how wrong is the model?" in a different way and carries its own gotchas. We are deliberately not unpacking them here — the full treatment, including when R² lies and which error metric to trust, lives on the Regression Metrics page. For now, just know that higher R² is better (1.0 is perfect, 0 means no better than guessing the mean) and that one number never tells the whole story.

The assumptions: when the line is the right tool

Linear regression is not magic; it is a bet that the world is roughly linear and additive. When that bet is reasonable, the model is excellent. When it is wrong, the model fails in predictable ways. Knowing the assumptions is knowing when to reach for the tool.

Roughly linear relationship. The target should rise or fall in an approximately straight-line way with each feature. If price quadruples when size doubles, a straight line will systematically miss.
Additive effects. The total prediction is a sum of independent feature contributions. The model assumes features do not multiply or interact unless you explicitly add such terms.
No extreme outliers dominating. Because errors are squared, a single wild point can yank the whole line toward itself.
Features not redundant with each other. Two features carrying nearly the same information (multicollinearity) make the individual coefficients unstable and hard to interpret, even if predictions are okay.

When to reach for linear regression

Use it when you want a fast, interpretable baseline, when the relationship looks roughly straight on a scatter plot, when you need to explain the effect of each feature in plain units, or when you have few samples and a flexible model would just overfit. It is almost always the right first model — beat it before you complicate things.

When NOT to use it

The same simplicity that makes linear regression clear also makes it the wrong choice for genuinely curvy or interaction-heavy problems.

Clearly nonlinear data. If the scatter plot bends — a U-shape, a saturating curve, a cycle — a straight line cannot capture it and will be biased everywhere. Either engineer features (add an x² term, take a log) or switch to a flexible model like a decision tree or a gradient-boosted ensemble.
Strong feature interactions. When the effect of one feature depends on another (a discount matters more on expensive items), the additive assumption breaks. Tree-based models handle interactions natively.
Many redundant features. Heavy multicollinearity makes coefficients wild and untrustworthy. Regularized cousins — Ridge and Lasso — tame this by penalizing large coefficients, and they are the usual next step.
Heavy outliers you cannot clean. Squared error makes a few extreme points dominate the fit. Robust regressors exist for this.

The fix is rarely "abandon regression" — it is "use a regression that fits the shape of your data." Linear regression is the floor, not the ceiling.

Here is the failure mode made visible. We fit a straight line to data that is actually curved, and you can see the line systematically miss.

The line is the best possible straight line, and it is still badly wrong — too high in the middle, too low at the ends. No amount of fitting fixes this, because the assumption itself (straightness) is violated. That low R² is the model honestly telling you "I am the wrong shape for this data."

A common misconception: low R² means broken code

A disappointing R² usually does not mean you made a mistake — it often means linear regression is the wrong model for this data, or the features genuinely do not explain the target. The fix is a better model or better features, not more fiddling with LinearRegression. Reading the residuals (where the model is wrong) tells you which.

Common misconceptions

"Linear regression draws a curve through the points." It draws a straight line (or flat plane). Any apparent curve comes only from features you transformed first (like adding a squared term). The model itself is strictly linear in its inputs.
"The line passes through the data points." It almost never passes through them; it threads through their middle, minimizing total squared distance. Residuals (gaps) are expected and normal.
"A bigger coefficient means a more important feature." Only if features share a scale. Coefficient size depends on the feature's units, so it is not a clean importance ranking on raw data.
"Linear regression needs the target to be normally distributed." The basic least-squares fit does not require that. Some inference (p-values, confidence intervals) leans on assumptions about residuals, but the prediction machinery itself just minimizes squared error.
"It is too simple to be useful." It is the workhorse of econometrics, epidemiology, and forecasting precisely because it is simple and interpretable. Simple and useful are not opposites.

Real-world applications

Linear regression is everywhere a number must be predicted and explained:

Economics and finance. Estimating how income, interest rates, or prices respond to inputs — where the coefficient is the headline result ("a 1% rate rise reduces demand by X").
Medicine and epidemiology. Relating risk factors to outcomes in a way regulators and clinicians can scrutinize line by line.
Real estate and pricing. A transparent first estimate of value from size, location, and age — easy to audit and defend.
Forecasting baselines. Before anyone deploys a complex model, a linear baseline sets the bar that complexity has to clear to be worth it.

In every case the appeal is the same pair of virtues: it is fast and it is honest about what it learned — you can read the rule straight off the coefficients.

Your turn

Build a one-feature linear regression and pull out what it learned.

Generate data with make_regression(n_samples=100, n_features=1, noise=12, random_state=1) into X and y.
Create a LinearRegression called model and .fit() it on X and y.
Store the fitted slope (the single coefficient) in slope and the intercept in intercept. (The slope is model.coef_[0].)
Use the model to predict the target at x = 1.5 and store the number in pred_at_1_5. Remember predict needs a 2D input: [[1.5]].

The hidden tests check that model is fitted, that slope is the expected strong positive value, and that pred_at_1_5 matches the line's own formula intercept + slope * 1.5.

Check your understanding

QuestionSelect one

What criterion does ordinary linear regression use to choose its line?

It minimizes the total squared vertical distance from the points to the line (least squares)

It connects the first and last data points

It passes through as many points as possible

It maximizes the correlation between two features

QuestionSelect one

A linear regression on house data has a coefficient of 80 on square_feet (target in dollars). What does that 80 mean?

The model is 80% accurate

80 houses were used to train the model

Each additional square foot adds about $80 to the predicted price, holding the other features fixed

The intercept of the line is 80

QuestionSelect one

You fit LinearRegression to data whose scatter plot is clearly U-shaped, and R² is very low. What is the most likely situation?

Your code has a bug; linear regression always fits well

The test set is too small to compute R²

A straight line cannot capture a curved relationship, so the model is the wrong shape — you need transformed features or a more flexible model

R² cannot be computed for nonlinear data

QuestionSelect one

With multiple features, how does linear regression form a single prediction?

It picks the one most important feature and ignores the rest

It computes a weighted sum of the features plus the intercept: intercept + w1*x1 + w2*x2 + ...

It averages the values of all the features

It multiplies all the features together

QuestionSelect one

Why can squared error make linear regression sensitive to outliers?

Outliers are always removed before fitting

Squaring makes all errors equal to one

Linear regression ignores points that are far from the line

Squaring a large residual produces a very large penalty, so a single far-off point can pull the whole line toward itself

Linear Regression

On this page