ROC Curves and AUC

In the last chapter we saw that a classifier's precision and recall depend on where you put the decision threshold. That raises an awkward question: if the threshold changes everything, how do you judge the model itself, separate from any one operating point? The ROC curve answers by evaluating the model at every threshold simultaneously, and AUC summarizes the whole picture in one number.

Two rates that move with the threshold

ROC is built from two quantities. Both come straight from the confusion matrix, recomputed at each threshold:

\text{TPR} = \frac{TP}{TP + FN} \qquad \text{(this is exactly recall)}

\text{FPR} = \frac{FP}{FP + TN} \qquad \text{(fraction of negatives wrongly flagged)}

TPR (recall, sensitivity): of the real positives, what fraction did we catch? Higher is better.
FPR (1 − specificity): of the real negatives, what fraction did we falsely flag? Lower is better.

As you lower the threshold, the model says "positive" more often, so both rates rise together — you catch more real positives (good) but also raise more false alarms (bad). The ROC curve traces this tradeoff.

The ROC curve

An ROC curve plots TPR (vertical) against FPR (horizontal) as the threshold sweeps from very strict to very lenient. Each threshold is one point on the curve.

Let us draw a real one. roc_curve returns the FPR and TPR at every useful threshold; roc_auc_score returns the area underneath.

How to read the picture:

The bottom-left corner (0, 0) is the strictest threshold: predict positive for no one. No false alarms, but no catches either.
The top-right corner (1, 1) is the most lenient: predict positive for everyone. Every real positive caught, but every negative falsely flagged too.
The curve connects them. A model that hugs the top-left corner is excellent — it achieves high TPR while keeping FPR low. A model that lies along the diagonal is no better than flipping a coin.

Why ROC needs probabilities, not classes

You cannot draw an ROC curve from a single set of 0/1 predictions — those already baked in one threshold. ROC needs the model's underlying scores (predict_proba or decision_function) so it can re-threshold them at many cutoffs. If a model only gives hard labels, it gets a single point, not a curve.

AUC: the area under the curve

The AUC (Area Under the ROC Curve) collapses the whole curve into one number between 0 and 1:

AUC = 1.0 — a perfect ranker: there exists a threshold that separates all positives from all negatives.
AUC = 0.5 — the diagonal: useless, no better than random.
AUC < 0.5 — worse than random, which usually means the scores are inverted (flip them and you have a good model).

The interpretation that actually sticks

AUC has a beautifully concrete meaning that has nothing to do with area:

AUC is the probability that the model gives a randomly chosen positive example a higher score than a randomly chosen negative example.

In other words, AUC measures ranking quality — how well the model sorts positives above negatives — independent of any threshold. Let us prove this equivalence by brute force.

The two numbers are identical. This is why AUC is so popular for comparing models: it asks "how well does this model rank cases?" without committing to a threshold, which is exactly the model-level question we wanted.

Comparing models with one plot

Because AUC is threshold-free, it is a clean way to rank models. Here a genuine model is compared against a deliberately weak one.

The stronger model's curve sits above and to the left of the weak one, and its AUC is higher. When one curve is entirely above another, that model dominates at every threshold.

What AUC does not tell you

AUC is powerful but it is not the whole story, and treating it as a single verdict causes real mistakes:

It says nothing about your specific threshold. A great AUC means a good threshold exists; it does not tell you the precision and recall at the cutoff you will actually deploy. You still have to choose and report an operating point.
It ignores calibration. AUC only cares about the order of scores, not whether a "0.9" really means a 90% chance. A model can have perfect AUC and wildly miscalibrated probabilities.
It can flatter a model on heavy class imbalance. Because FPR has the large negative class in its denominator, a flood of false positives barely moves it. AUC can look excellent while precision is poor.

On heavy imbalance, prefer the precision–recall curve

When the positive class is rare and precision is what you care about, ROC AUC can be misleadingly rosy. The precision–recall curve and its area (average precision) focus on the positive class and give a more honest picture. Reach for it whenever positives are scarce and false positives matter.

Common misconceptions

"AUC of 0.9 means 90% accuracy." No. AUC is the probability of ranking a random positive above a random negative — a ranking measure, not an accuracy rate. A model can have AUC 0.9 and mediocre accuracy at 0.5.
"A high AUC means the model is good at my operating point." AUC aggregates over all thresholds. You must still inspect precision/recall at the threshold you deploy.
"AUC below 0.5 is a bug." It usually means the scores are inverted — the model learned the right pattern with the sign flipped.
"ROC AUC is always the right metric." On rare-positive problems it can hide a poor precision; the precision–recall curve is often more informative there.

Real-world applications

AUC is the lingua franca for comparing classifiers in credit scoring, medical diagnosis, churn prediction, and information retrieval — anywhere you rank cases by risk or relevance and then choose a cutoff later. A credit team might compare ten models by AUC to pick the best ranker, then separately choose the approval threshold that hits their target default rate. The two decisions — which model, which threshold — are exactly the two things ROC analysis keeps cleanly apart.

Your turn

Two models are trained on the breast cancer data. Your job is to score them by AUC on the test set.

For each fitted model, get the positive-class probabilities with model.predict_proba(X_test)[:, 1].
Compute auc_logreg and auc_tree with roc_auc_score.
Set better to the string "logreg" or "tree" — whichever has the higher AUC.

The tests confirm both AUCs are valid probabilities (between 0 and 1) and that better correctly names the higher-AUC model.

Check your understanding

QuestionSelect one

What does an ROC curve plot?

Precision against recall

Accuracy against threshold

True Positive Rate against False Positive Rate as the decision threshold varies

Training error against test error

QuestionSelect one

Which interpretation of AUC is correct?

The percentage of predictions that are correct

The model's accuracy at threshold 0.5

The probability that the model scores a randomly chosen positive higher than a randomly chosen negative

The fraction of variance explained

QuestionSelect one

A model has an ROC AUC of exactly 0.5. What does this indicate?

Perfect classification

The threshold is set too high

The model ranks positives and negatives no better than random guessing

The classes are perfectly balanced

QuestionSelect one

Why can you not compute an ROC curve from a model's hard 0/1 predictions alone?

ROC curves require regression outputs

Hard predictions have already fixed a single threshold; ROC needs the underlying scores or probabilities to re-threshold at many cutoffs

ROC curves only work for balanced data

scikit-learn forbids it

QuestionSelect one

On a dataset where only 1% of cases are positive, a model shows ROC AUC = 0.95 but its precision is poor. What is the most likely explanation?

The AUC must be a bug

With a huge negative class, many false positives barely move the FPR, so ROC AUC can look excellent while precision (which depends on FP relative to predicted positives) stays low

High AUC guarantees high precision

Precision and AUC always agree

QuestionSelect one

A teammate reports "AUC = 0.92, so we're done." What important thing does AUC not tell them?

How well the model ranks cases

Whether the model beats random

The precision and recall at the specific threshold they will actually deploy — AUC aggregates over all thresholds and ignores any single operating point

Whether one model outranks another

Two rates that move with the threshold

The ROC curve

AUC: the area under the curve

The interpretation that actually sticks

Comparing models with one plot

What AUC does not tell you

Common misconceptions

Real-world applications

Your turn

Check your understanding

ROC Curves and AUC

On this page