Regression and Trend Lines
lmplot and regplot — fitting a line or curve through a scatter to summarize a relationship, without fooling yourself.
A scatter plot shows you every point. Often you want one more thing: a single line drawn through the cloud that says "on average, as x goes up, y does this." That line is a regression fit, and Seaborn can add it on top of the scatter for you.
A fitted line is a powerful summary — and a powerful way to mislead yourself. It can imply a relationship is simpler, stronger, or more trustworthy than the data really supports. So this page teaches the tools and the discipline: how to draw a trend line, and how to keep it honest.
What a regression line is (and isn't)
Seaborn has two functions that draw a scatter plus a fitted line:
sns.regplot— axes-level. Draws onto a single Axes and returns it.sns.lmplot— figure-level. Makes its own figure and, crucially, supportshue,col, androwto fit lines per group or per panel.
Both draw the same two things: the points, and a line that estimates the central tendency of y for each value of x — your best single guess for y given x. Around that line is a translucent band.
What the shaded band means
The band is a 95% confidence interval for the fitted line itself — a measure of how well-pinned-down the line is, given this sample. It is not a range that contains 95% of the data points, and it is not a prediction interval for a new observation. A narrow band means "we're fairly sure where the average sits," not "most points fall in here."
The minimum
Map two numeric columns and let lmplot fit a straight line:
The line slopes up: larger bills tend to come with larger tips. The band is narrow in the middle — where most bills sit, so the average is well estimated — and flares at the extremes, where there are fewer points to pin it down. Notice the points are still there: the line is a summary laid over the evidence, never a replacement for it.
A line per group with hue
Because lmplot is figure-level, mapping a categorical column to hue
fits a separate line for each group and colors them. This answers a
sharper question: does the relationship differ between groups?
Now you can compare slopes directly. If two lines are nearly parallel, the
groups respond to total_bill similarly; if they fan apart, the
relationship itself depends on the group. (Want them side by side instead
of overlaid? Use col="smoker" for one panel each — the same faceting idea
from the relational plots.)
The key lesson: a straight line on a curve lies
A straight fit assumes the relationship is a straight line. When it
isn't, the line is not just imprecise — it is wrong, and it will lie to
you confidently. The mpg dataset shows this beautifully: fuel efficiency
versus engine horsepower.
Look closely at how the line sits against the points. The relationship bends — mpg falls steeply at low horsepower and then flattens — but the straight line splits the difference. It runs above the data at both ends and below it in the middle. Read off any prediction and it is off. The line is precise-looking and simply incorrect.
There are two honest fixes.
Fit a curve with order. Setting order=2 fits a degree-2 polynomial
(a parabola), which can follow a single bend:
Or let the data choose its own shape with lowess=True. LOWESS is a
locally weighted smoother: instead of assuming any global formula, it
fits the trend in small neighborhoods and stitches them together. It is the
best "just show me the shape, don't impose one" option.
sns.lmplot(data=mpg, x="horsepower", y="mpg", lowess=True)The polynomial curve hugs the points far better than the straight line did,
and a LOWESS fit would bend with them even more flexibly. The lesson
generalizes: always eyeball the scatter first. If it bends, a straight
lmplot is the wrong summary, and you reach for order= or lowess=True.
You draw sns.lmplot(data=mpg, x="horsepower", y="mpg") and the straight
line clearly sits above the points at both ends and below them in the
middle. What is the right conclusion?
Horsepower and mpg are unrelated, so the line is flat noise.
The relationship is curved, so a straight-line fit is the wrong model; use order=2 or lowess=True.
The confidence band is too narrow and should be widened.
You should remove the points and keep only the line for clarity.
Two more fitting options, briefly
robust=True down-weights outliers so a few extreme points can't drag
the whole line toward themselves. Use it when a handful of stray points are
distorting an otherwise clear linear trend (it is slower, since it
re-weights iteratively).
sns.lmplot(data=tips, x="total_bill", y="tip", robust=True)logistic=True is for a binary outcome — a y that is only 0 or 1.
Instead of a straight line, it fits an S-shaped logistic curve estimating
the probability that y = 1 as x changes. The titanic dataset has a 0/1
survived column:
# survived is 0/1; a logistic fit estimates P(survived) vs age.
sns.lmplot(data=titanic, x="age", y="survived", logistic=True)A straight line here would happily predict probabilities above 1 or below 0, which is nonsense; the logistic curve stays between 0 and 1 by design.
Diagnosing a fit with residplot
How do you check whether a straight line was appropriate, rather than
just eyeballing it? Plot the residuals — the leftover gap between each
point and the line — against x. sns.residplot does this. The rule is
simple:
- Patternless cloud around zero → the straight-line model captured the trend; what's left is just noise.
- A visible curve in the residuals → the model missed a bend, so the
straight fit is wrong (exactly the
horsepower/mpgsituation).
The residuals smile (a U-shape) rather than scattering flatly around zero — a clear signal that the relationship bent and a straight line could not follow it.
The cautions that matter most
A trend line is the easiest plot to over-trust. Three rules keep you honest.
Association is not causation
A sloping line means x and y move together in this data. It says nothing about whether changing x would cause y to change. Larger bills come with larger tips — but that does not prove that inflating a bill causes a bigger tip; party size, service, and occasion all lurk behind both. A regression line describes an association; only careful study design earns the word "causes."
Never extrapolate past the data
The fit is only evidence within the range of x you actually observed.
Extending the line to horsepower values you have no data for is guesswork
dressed up as a measurement — and if the true relationship curves (as
mpg does), the extrapolation is not just uncertain, it is wrong. Trust
the line only over the x-range the points cover.
Always show the points
The band is not the data, and the line is not the data — the points are. Showing them is what let you catch the curved-fit problem above. A lonely line with a shaded band hides whether the fit is any good. Keep the scatter visible so the summary stays accountable to the evidence.
Your turn
Using the tips dataset, draw a regression plot with sns.lmplot:
total_billon the x-axis,tipon the y-axis,- a separate fitted line per
smokergroup (usehue).
Assign the result to a variable named g.
Check your understanding
What does the line drawn by regplot / lmplot represent?
The path connecting every data point in order.
An estimate of the average value of y for each value of x.
The boundary that separates outliers from normal points.
The 95% range that contains most of the data.
The translucent band around an lmplot line is best described as:
A region guaranteed to contain 95% of the data points.
The range of a future single prediction.
A 95% confidence interval for the fitted line itself.
The standard deviation of x.
Your scatter of horsepower vs mpg is clearly curved. Which approach
gives an honest trend?
Keep the straight fit but widen the confidence band.
Drop the points so only the straight line shows.
Use order=2 for a polynomial fit, or lowess=True to follow the local trend.
Swap to regplot; it automatically detects curvature.
A regression of ice-cream sales on drowning incidents shows a strong upward line. What can you correctly conclude?
Buying ice cream causes drownings.
Drownings cause people to buy ice cream.
The two variables are associated in this data; a lurking factor (like hot weather) may drive both.
Nothing — a positive slope carries no information at all.
You can now lay an honest trend over a scatter — and spot when that trend is lying. Next we widen the lens from one pair of variables to all of them at once, with the correlation heatmap.