ROC Curves and AUC
Precision and recall judge a classifier at one threshold. ROC curves judge it at every threshold at once — and AUC boils that into a single, threshold-free score.
In the last chapter we saw that a classifier's precision and recall depend on where you put the decision threshold. That raises an awkward question: if the threshold changes everything, how do you judge the model itself, separate from any one operating point? The ROC curve answers by evaluating the model at every threshold simultaneously, and AUC summarizes the whole picture in one number.
Two rates that move with the threshold
ROC is built from two quantities. Both come straight from the confusion matrix, recomputed at each threshold:
- TPR (recall, sensitivity): of the real positives, what fraction did we catch? Higher is better.
- FPR (1 − specificity): of the real negatives, what fraction did we falsely flag? Lower is better.
As you lower the threshold, the model says "positive" more often, so both rates rise together — you catch more real positives (good) but also raise more false alarms (bad). The ROC curve traces this tradeoff.
The ROC curve
An ROC curve plots TPR (vertical) against FPR (horizontal) as the threshold sweeps from very strict to very lenient. Each threshold is one point on the curve.
Let us draw a real one. roc_curve returns the FPR and TPR at every useful
threshold; roc_auc_score returns the area underneath.
How to read the picture:
- The bottom-left corner
(0, 0)is the strictest threshold: predict positive for no one. No false alarms, but no catches either. - The top-right corner
(1, 1)is the most lenient: predict positive for everyone. Every real positive caught, but every negative falsely flagged too. - The curve connects them. A model that hugs the top-left corner is excellent — it achieves high TPR while keeping FPR low. A model that lies along the diagonal is no better than flipping a coin.
Why ROC needs probabilities, not classes
You cannot draw an ROC curve from a single set of 0/1 predictions — those
already baked in one threshold. ROC needs the model's underlying scores
(predict_proba or decision_function) so it can re-threshold them at many
cutoffs. If a model only gives hard labels, it gets a single point, not a
curve.
AUC: the area under the curve
The AUC (Area Under the ROC Curve) collapses the whole curve into one number between 0 and 1:
- AUC = 1.0 — a perfect ranker: there exists a threshold that separates all positives from all negatives.
- AUC = 0.5 — the diagonal: useless, no better than random.
- AUC < 0.5 — worse than random, which usually means the scores are inverted (flip them and you have a good model).
The interpretation that actually sticks
AUC has a beautifully concrete meaning that has nothing to do with area:
AUC is the probability that the model gives a randomly chosen positive example a higher score than a randomly chosen negative example.
In other words, AUC measures ranking quality — how well the model sorts positives above negatives — independent of any threshold. Let us prove this equivalence by brute force.
The two numbers are identical. This is why AUC is so popular for comparing models: it asks "how well does this model rank cases?" without committing to a threshold, which is exactly the model-level question we wanted.
Comparing models with one plot
Because AUC is threshold-free, it is a clean way to rank models. Here a genuine model is compared against a deliberately weak one.
The stronger model's curve sits above and to the left of the weak one, and its AUC is higher. When one curve is entirely above another, that model dominates at every threshold.
What AUC does not tell you
AUC is powerful but it is not the whole story, and treating it as a single verdict causes real mistakes:
- It says nothing about your specific threshold. A great AUC means a good threshold exists; it does not tell you the precision and recall at the cutoff you will actually deploy. You still have to choose and report an operating point.
- It ignores calibration. AUC only cares about the order of scores, not whether a "0.9" really means a 90% chance. A model can have perfect AUC and wildly miscalibrated probabilities.
- It can flatter a model on heavy class imbalance. Because FPR has the large negative class in its denominator, a flood of false positives barely moves it. AUC can look excellent while precision is poor.
On heavy imbalance, prefer the precision–recall curve
When the positive class is rare and precision is what you care about, ROC AUC can be misleadingly rosy. The precision–recall curve and its area (average precision) focus on the positive class and give a more honest picture. Reach for it whenever positives are scarce and false positives matter.
Common misconceptions
- "AUC of 0.9 means 90% accuracy." No. AUC is the probability of ranking a random positive above a random negative — a ranking measure, not an accuracy rate. A model can have AUC 0.9 and mediocre accuracy at 0.5.
- "A high AUC means the model is good at my operating point." AUC aggregates over all thresholds. You must still inspect precision/recall at the threshold you deploy.
- "AUC below 0.5 is a bug." It usually means the scores are inverted — the model learned the right pattern with the sign flipped.
- "ROC AUC is always the right metric." On rare-positive problems it can hide a poor precision; the precision–recall curve is often more informative there.
Real-world applications
AUC is the lingua franca for comparing classifiers in credit scoring, medical diagnosis, churn prediction, and information retrieval — anywhere you rank cases by risk or relevance and then choose a cutoff later. A credit team might compare ten models by AUC to pick the best ranker, then separately choose the approval threshold that hits their target default rate. The two decisions — which model, which threshold — are exactly the two things ROC analysis keeps cleanly apart.
Your turn
Two models are trained on the breast cancer data. Your job is to score them by AUC on the test set.
- For each fitted model, get the positive-class probabilities with
model.predict_proba(X_test)[:, 1]. - Compute
auc_logregandauc_treewithroc_auc_score. - Set
betterto the string"logreg"or"tree"— whichever has the higher AUC.
The tests confirm both AUCs are valid probabilities (between 0 and 1) and
that better correctly names the higher-AUC model.
Check your understanding
What does an ROC curve plot?
Precision against recall
Accuracy against threshold
True Positive Rate against False Positive Rate as the decision threshold varies
Training error against test error
Which interpretation of AUC is correct?
The percentage of predictions that are correct
The model's accuracy at threshold 0.5
The probability that the model scores a randomly chosen positive higher than a randomly chosen negative
The fraction of variance explained
A model has an ROC AUC of exactly 0.5. What does this indicate?
Perfect classification
The threshold is set too high
The model ranks positives and negatives no better than random guessing
The classes are perfectly balanced
Why can you not compute an ROC curve from a model's hard 0/1 predictions alone?
ROC curves require regression outputs
Hard predictions have already fixed a single threshold; ROC needs the underlying scores or probabilities to re-threshold at many cutoffs
ROC curves only work for balanced data
scikit-learn forbids it
On a dataset where only 1% of cases are positive, a model shows ROC AUC = 0.95 but its precision is poor. What is the most likely explanation?
The AUC must be a bug
With a huge negative class, many false positives barely move the FPR, so ROC AUC can look excellent while precision (which depends on FP relative to predicted positives) stays low
High AUC guarantees high precision
Precision and AUC always agree
A teammate reports "AUC = 0.92, so we're done." What important thing does AUC not tell them?
How well the model ranks cases
Whether the model beats random
The precision and recall at the specific threshold they will actually deploy — AUC aggregates over all thresholds and ignores any single operating point
Whether one model outranks another
Classification Metrics — Beyond Accuracy
Accuracy is the metric everyone reaches for first and the one that misleads most often. The confusion matrix, precision, recall, and F1 tell the real story.
Decision Trees
A model you can read like a flowchart — a cascade of yes/no questions that splits the data into ever-purer groups. Intuitive, requires no scaling, and the gateway to random forests.