Logistic Regression
Despite its name, this is a classification algorithm — a linear score squashed into a probability, then a threshold. The interpretable workhorse for predicting yes-or-no.
Let us clear up the most confusing name in machine learning before we do anything else. Logistic regression is a classification algorithm. It predicts categories — spam or not, malignant or benign, will-churn or will-stay — not numbers. The word "regression" in its name is a historical accident about the math under the hood, and it trips up nearly every newcomer. Whenever you see "logistic regression," think classifier.
With that settled: logistic regression is to classification what linear regression is to predicting numbers — the simple, fast, interpretable baseline you reach for first and that everything fancier has to beat. It is probably the single most widely deployed classifier in the world.
The one misconception to kill on sight
Logistic regression does NOT predict a continuous number, and it is not a flavor of linear regression for regression tasks. It predicts a class. The "regression" in the name refers to the linear score it computes internally, which it then squashes into a probability and converts to a class. If you take away one thing from this page, take this: logistic regression is a classification algorithm.
Where this sits
The four-move workflow is unchanged — load, split, train, evaluate — and
this is still a .fit() / .predict() estimator like the ones in Your
First Model, End to End. What is new is how it makes a decision and the
extra .predict_proba() method for probabilities. If linear regression's
coefficients are fresh in your mind, you are well set up: logistic
regression reuses the same weighted-sum idea.
The problem it solves
You want a yes-or-no answer, and ideally a confidence with it. Is this transaction fraudulent? Will this patient's tumor turn out malignant? Will this user click? A bare label ("fraud") is useful, but a probability ("87% chance of fraud") is far more useful — it lets you rank cases, set thresholds, and decide how much to trust each call.
Linear regression cannot do this directly. Its straight line runs off to plus and minus infinity, so it happily predicts -0.3 or 1.8 — nonsense as a probability, which must live between 0 and 1. Logistic regression is the fix: it keeps linear regression's interpretable weighted-sum core but bends the output into the 0-to-1 range so it reads as a genuine probability.
The intuition: linear score, then a squash
Logistic regression works in three steps. Hold this picture in your head and the whole algorithm follows.
Step 1 — a linear score. Exactly like linear regression, the model
computes a weighted sum of the features plus a constant:
z = intercept + w1*x1 + w2*x2 + .... This z can be any number, from
hugely negative to hugely positive. A large positive z leans toward
"yes"; a large negative z leans toward "no."
Step 2 — squash with the sigmoid. We feed z through the sigmoid
function, p = 1 / (1 + e^-z). The sigmoid is an S-shaped curve that takes
any number and gently maps it into the open interval (0, 1). Very negative
z maps near 0, very positive z near 1, and z = 0 maps to exactly 0.5.
Now the output is a legitimate probability.
Step 3 — threshold into a class. A probability is not yet a decision. We
apply a threshold, by default 0.5: if p >= 0.5 predict class 1,
otherwise class 0. That is the difference between .predict_proba() (gives
you p) and .predict() (gives you the class after thresholding).
Meet the sigmoid
The sigmoid is the heart of the model: an S-curve that turns an unbounded score into a 0-to-1 probability. It is steep in the middle (small changes in the score move the probability a lot near 0.5) and flat at the extremes (once the model is very confident, more evidence barely moves the probability). That shape is exactly what we want from a confidence: decisive in the ambiguous middle, saturating when the answer is clear.
Let us actually look at the sigmoid so the S-shape is concrete.
See how z = 0 lands exactly on p = 0.5 (the threshold line), and how the
curve flattens toward 0 and 1 at the edges. The point where z = 0 — and
therefore p = 0.5 — is the decision boundary: on one side the model
predicts class 1, on the other class 0.
The decision boundary is a straight line
Because the score z is a linear combination of the features, the place
where z = 0 (the boundary between the two classes) is a straight line in
2D, a flat plane in 3D, a hyperplane in general. This is the defining trait
of logistic regression: it carves the feature space with a single
straight cut. That makes it simple and interpretable, and it is also its
main limitation, as we will see.
Logistic regression in practice: breast cancer
Time to train one. The load_breast_cancer dataset has 569 tumors, each
described by 30 measurements (radius, texture, smoothness, and so on). The
target is binary: 0 = malignant, 1 = benign. We will predict which from
the measurements.
Around 94% accuracy from a model that is, at heart, a single weighted sum passed through an S-curve. That is the appeal: cheap, fast, and remarkably strong on data where a straight boundary roughly separates the classes.
Why max_iter=1000?
Logistic regression finds its coefficients by iterative optimization. On
some datasets the default iteration budget is too small and you get a
"failed to converge" warning. Passing max_iter=1000 simply gives the
optimizer more steps. It does not change the model — it just lets the fit
finish settling. (Scaling the features, covered in the data-prep chapter,
also helps it converge faster.)
.predict() versus .predict_proba()
This is the part worth slowing down for, because the probability is often
the whole reason to use logistic regression. .predict() gives you the
class. .predict_proba() gives you the probability behind that class — the
model's confidence.
A few things to read off that table. predict_proba returns one column
per class, in the order shown by model.classes_, and each row sums to
1 (the probabilities of all classes must add up). The predicted class is
simply the column with the higher probability — that is, applying the 0.5
threshold to the class-1 probability.
predict is just predict_proba plus a threshold
For binary logistic regression, model.predict(X) is exactly
model.predict_proba(X)[:, 1] >= 0.5. The class is the probability after
thresholding. This means you can change the cutoff yourself: if missing a
malignant tumor is far worse than a false alarm, you might predict
"malignant" whenever its probability exceeds, say, 0.3 instead of 0.5 —
trading more false alarms for fewer misses. Choosing that threshold well —
using precision, recall, and ROC curves — is a core skill the course's
classification-evaluation material covers in depth.
Let us make the threshold idea tangible by moving it ourselves.
Raise the threshold and the model becomes more reluctant to say "benign"; lower it and it says "benign" more freely. Same model, same probabilities — only the cutoff moved. The probability is the model's real output; the threshold is a decision you control.
Coefficients: log-odds, kept intuitive
Like linear regression, a fitted logistic regression stores its learning in
coef_ and intercept_. The interpretation is slightly less direct,
because the coefficients act on the score z, not on the probability
itself. Keep it intuitive:
- A positive coefficient means: as that feature increases, the score
zrises, which pushes the probability of class 1 up. - A negative coefficient pushes the probability of class 1 down.
- A coefficient near zero means that feature barely moves the decision.
The precise statement is that each coefficient is the change in the log- odds of class 1 per unit of the feature, but you rarely need that phrasing day to day. The sign tells you the direction of the push and the magnitude tells you the strength — and, just as with linear regression, magnitudes are only comparable when the features share a scale.
Coefficient size depends on feature scale
The same caution from linear regression applies: a coefficient's magnitude is tied to its feature's units, so do not read raw coefficient sizes as a clean importance ranking. Logistic regression also genuinely benefits from scaled features — it converges faster and the coefficients become comparable. The data-preparation chapter shows how to scale inside a pipeline so it stays leak-free.
Multiclass: it is not only for two classes
Logistic regression extends naturally to more than two classes. scikit-learn handles this for you — on iris (three species) it simply learns a score per class and picks the highest. The code is identical; only the number of probability columns changes.
Each row still sums to 1, now across three classes, and the predicted species is the column with the largest probability. The mental model is the same: linear scores, squashed and normalized into probabilities, then the biggest one wins.
When to use it — and when not to
When to reach for logistic regression
Use it when you want a fast, interpretable classification baseline, when you need calibrated-ish probabilities rather than just labels, when the classes are roughly separable by a straight boundary, or when you must explain which features drive the decision and in which direction. As with linear regression, it is almost always the right first classifier — beat it before reaching for anything heavier.
The flip side is that its straight-line boundary is also its ceiling.
- Nonlinear boundaries. If the classes are separated by a curve or a ring rather than a straight line, a single linear cut cannot split them. The classic illustration is two interleaving moons or concentric circles — logistic regression draws one straight line through them and fails. You either engineer nonlinear features or switch to a model that bends, like a decision tree, a random forest, or an SVM with a nonlinear kernel.
- Strong feature interactions. Like linear regression, the basic model is additive; effects that depend on combinations of features are not captured unless you add interaction terms.
- Highly correlated features. Heavy multicollinearity makes individual coefficients unstable. Regularization (which scikit-learn applies by default) helps, but interpret coefficients cautiously.
Here is the nonlinear failure made visible. make_moons produces two
interleaving crescents that no straight line can separate.
The accuracy is mediocre — not because the model is broken, but because the problem's true boundary is curved and logistic regression can only draw a straight one. That is the assumption talking. On data with a roughly linear boundary (like the breast cancer set earlier), the very same model shines.
Evaluation goes deeper than accuracy
Accuracy is a fine first headline, but for classification it can badly mislead — especially with imbalanced classes, where predicting the majority every time scores high while being useless. Precision, recall, the confusion matrix, and the ROC curve tell the real story, and the course's classification-evaluation material gives them full treatment. We are deliberately not unpacking them here so this page stays about the algorithm, not the scoring.
Common misconceptions
- "Logistic regression predicts a number, like linear regression." It
predicts a class. It computes a number internally (the score
z) and a probability, but its job is classification. This is the misconception to guard against hardest. - "The probabilities come straight from a line." They come from a line passed through the sigmoid. The sigmoid is what keeps the output in (0, 1) and gives the S-shaped, saturating confidence.
- "The 0.5 threshold is a law of nature." It is just the default. You can and often should move it to trade false positives against false negatives, depending on which error is costlier.
- "It can learn any decision boundary." Its boundary is strictly straight (linear). Curved boundaries require engineered features or a different model.
- "Higher accuracy always means a better classifier." Not with imbalanced classes. A model that ignores the rare class can post high accuracy and be worthless — which is why the metrics pages exist.
Real-world applications
Logistic regression is, quietly, one of the most-deployed models anywhere, precisely because it is fast, interpretable, and outputs probabilities:
- Medicine. Estimating the probability a tumor is malignant, a patient will be readmitted, or a treatment will succeed — where clinicians need to see why, feature by feature.
- Finance. Credit scoring and fraud detection lean on it heavily; regulators often require models whose decisions can be explained, which rules out black boxes and favors logistic regression.
- Marketing and product. Predicting click-through, conversion, or churn, and ranking users by probability so effort goes where it pays.
- A baseline everywhere. Before any complex classifier ships, a logistic regression sets the bar. If the fancy model cannot beat it, the complexity is not earning its keep.
In every case the draw is the same trio: probabilities, interpretability, and speed — a straight-line decision you can read off the coefficients.
Your turn
Build a logistic regression on the breast cancer data and pull out both a class prediction and a probability.
- Load the data with
load_breast_cancer(return_X_y=True)intoXandy(0 = malignant, 1 = benign). - Split into train/test with 20% in the test set,
random_state=42, and stratified ony. UseX_train, X_test, y_train, y_test. - Create a
LogisticRegression(max_iter=1000)calledmodeland.fit()it on the training data. - Store its test accuracy in
accuracy(viamodel.score(...)). - For the first test tumor (
X_test[:1]), store the probability that it is benign (class 1) inp_benign. Hint: it is column 1 ofmodel.predict_proba(X_test[:1]).
The hidden tests check the split size, that model is fitted, that
accuracy is sensibly high, and that p_benign is a valid probability
between 0 and 1.
Check your understanding
Despite its name, what kind of task is logistic regression built for?
Classification — predicting a category (and a probability), such as malignant versus benign
Predicting a continuous number, like linear regression
Clustering unlabeled data into groups
Reducing the number of features in a dataset
What is the role of the sigmoid function in logistic regression?
It removes outliers before fitting
It squashes the unbounded linear score into a probability between 0 and 1
It selects which features to keep
It computes the accuracy of the model
For a binary logistic regression, how does model.predict(X) relate to model.predict_proba(X)?
They are unrelated methods that can disagree
predict ignores probabilities and uses the raw features
predict applies a threshold (default 0.5) to the class-1 probability from predict_proba
predict_proba rounds the output of predict
Why does logistic regression struggle on data shaped like two interleaving moons (make_moons)?
The dataset is too small to fit
Logistic regression cannot output probabilities
make_moons produces continuous targets, not classes
Its decision boundary is a straight line, but the true boundary between the crescents is curved
A feature has a large positive coefficient in a fitted logistic regression (target: class 1 = benign). What does that imply, intuitively?
The feature is measured in the wrong units
Increasing that feature lowers the probability of class 1
Increasing that feature raises the linear score, pushing the predicted probability of class 1 (benign) upward
The model has overfit the training data
Regression Metrics — MAE, MSE, RMSE, R²
A model predicts numbers — but how wrong is it, and does "wrong" mean a few big misses or many small ones? Choosing and reading the right error metric is a skill in itself.
Classification Metrics — Beyond Accuracy
Accuracy is the metric everyone reaches for first and the one that misleads most often. The confusion matrix, precision, recall, and F1 tell the real story.