Classification Metrics — Beyond Accuracy
Accuracy is the metric everyone reaches for first and the one that misleads most often. The confusion matrix, precision, recall, and F1 tell the real story.
When a model predicts categories, the obvious question is "how often is it right?" — its accuracy. Accuracy is intuitive, easy to compute, and genuinely useful sometimes. It is also responsible for more bad models shipped with confidence than any other number in machine learning. This chapter explains why, and gives you the tools that do not lie: the confusion matrix, precision, recall, and F1.
The accuracy trap
Imagine a disease that affects 2% of patients. Build a "model" that ignores its inputs and predicts "healthy" for everyone. Its accuracy? 98%. It is also completely worthless — it never catches a single sick patient. Let us watch this happen for real.
Over 90% accuracy, and it caught zero positive cases. The accuracy is high only because the negative class is huge and easy. On any problem where the rare class is the one you care about — fraud, disease, defaults, defects — accuracy alone is dangerously misleading.
Accuracy hides what matters on imbalanced data
On imbalanced problems, a useless majority-class predictor can post a gorgeous accuracy while completely failing at the actual task. Whenever the classes are imbalanced, treat raw accuracy with deep suspicion and look at per-class metrics instead.
The confusion matrix: where every metric is born
To talk precisely about classifier mistakes, we first need names for the four things that can happen on a binary problem. Compare each prediction to the truth:
- True Positive (TP) — predicted positive, and it was. A correct catch.
- False Positive (FP) — predicted positive, but it was negative. A false alarm. (Statisticians call this a Type I error.)
- False Negative (FN) — predicted negative, but it was positive. A missed case. (A Type II error.)
- True Negative (TN) — predicted negative, and it was. A correct pass.
The confusion matrix simply counts these four outcomes. scikit-learn lays it out with actual classes as rows and predicted classes as columns.
Read the corners carefully
With the default labels [0, 1], the bottom-right cell is TP (actual 1,
predicted 1) and the top-left is TN. The off-diagonal cells are the
mistakes: top-right is FP, bottom-left is FN. The whole point of the matrix
is that it separates the two kinds of mistake, which a single accuracy
number blends together.
Precision: when false alarms are expensive
Precision answers: of all the cases the model flagged as positive, what fraction really were?
- What it measures. The trustworthiness of a positive prediction. High precision means "when this model says positive, believe it."
- What it does not measure. How many positives you missed. A model can have perfect precision by making exactly one extremely confident positive prediction and ignoring every other real case.
- When it matters most. When false positives are costly. A spam filter that sends a real, important email to the junk folder (FP) has failed badly — so spam filters prize precision. Likewise a model that flags customers for an expensive intervention.
Recall: when misses are expensive
Recall (also called sensitivity) answers: of all the cases that were actually positive, what fraction did the model catch?
- What it measures. Coverage of the real positives. High recall means "few positives slip through."
- What it does not measure. How many false alarms you raised getting there. A model can reach perfect recall by predicting positive for everyone — catching all real cases and drowning in false positives.
- When it matters most. When false negatives are costly. A cancer screen that misses a real tumor (FN) can cost a life, so screening prizes recall even at the price of more false alarms (which a follow-up test can rule out).
Precision and recall pull in opposite directions
You can almost always raise one by sacrificing the other. Flag more cases as positive and recall rises while precision falls; flag fewer and the reverse happens. A single number cannot capture a classifier — you need at least this pair, chosen around which mistake hurts more.
The threshold is the dial
Most classifiers do not really output a class — they output a probability, and a threshold (0.5 by default) turns it into a class. Moving that threshold is exactly how you trade precision against recall.
Read the table top to bottom: as the threshold rises, the model demands more confidence before saying "positive," so precision climbs and recall falls. There is no universally correct threshold — it depends on whether a false alarm or a missed case is worse for your problem. The default 0.5 is just a convention, not a law.
F1: one number when you must have one
Sometimes you genuinely need a single score — to rank models, to early-stop a search. Averaging precision and recall with a plain mean is too forgiving (a model with precision 1.0 and recall 0.0 would score 0.5). The F1 score uses the harmonic mean, which stays low unless both are decent.
- What it measures. A balance of precision and recall that punishes lopsidedness. F1 is high only when neither is being sacrificed.
- What it does not measure. The true negatives, and your actual relative cost of FP versus FN — F1 weights precision and recall equally, which may not match your real priorities.
- When it can mislead. If false positives and false negatives have very different costs, the equal weighting in F1 is wrong for you, and you should optimize precision or recall directly (or a weighted F-beta score).
classification_report: the whole picture at once
In practice you rarely compute these by hand. classification_report
prints precision, recall, and F1 for every class, plus support (how many
true examples of each class there were).
Look at the positive row specifically — its precision and recall are the
honest report card on the rare, important class, the one accuracy was
hiding. The macro avg averages each metric across classes treating them
equally (good for imbalance); the weighted avg weights by support (closer
to overall accuracy).
Multi-class: the same ideas, averaged
For more than two classes, precision, recall, and F1 are computed per class
(one-vs-rest) and then averaged. macro treats every class equally;
weighted accounts for class sizes; micro pools all decisions together.
On imbalanced multi-class problems, prefer macro so a tiny class is not
ignored.
Choosing the right metric
The metric is not a technicality — it is a statement of values. "We must not miss a real case" means recall. "We must not cry wolf" means precision. Pick the metric that matches the cost of being wrong in your world.
Common misconceptions
- "High accuracy means a good classifier." Only when classes are balanced and both errors cost the same. Otherwise it can be a mirage.
- "Precision and recall are basically the same thing." They answer opposite questions — trustworthiness of positive predictions versus coverage of real positives — and improving one often worsens the other.
- "F1 is always the best single metric." F1 assumes FP and FN matter equally. When they do not, F1 quietly optimizes the wrong thing.
- "The 0.5 threshold is fixed." It is an arbitrary default. Tuning the threshold to your costs is one of the cheapest, highest-impact moves in applied classification.
- "Recall of 100% means the model is great." Predicting positive for everything achieves it — with disastrous precision. Always read the pair.
Real-world applications
A fraud team tunes for recall (catch the fraud) while capping false positives so they do not freeze honest customers. An email provider tunes spam filters for precision (never lose a real email). A medical screen maximizes recall and lets a cheap confirmatory test clean up the false alarms. In each case the confusion matrix — not accuracy — is the artifact the team actually argues over.
Your turn
A LogisticRegression is fit on an imbalanced dataset, and
y_test / y_pred are provided (positive class is 1).
- Compute the confusion matrix into
cmwithconfusion_matrix(y_test, y_pred). - Unpack it into
tn, fp, fn, tp(usecm.ravel()). - Compute
precisionandrecallfor the positive class from those four counts using the formulas (not by re-calling sklearn) —precision = tp / (tp + fp)andrecall = tp / (tp + fn).
The tests confirm your hand-computed precision and recall match
scikit-learn's precision_score and recall_score.
Check your understanding
A model achieves 97% accuracy on a dataset where 96% of cases are negative. What is the most important next check?
Ship it; 97% is excellent
Increase the training set size
Look at per-class metrics (precision/recall on the positive class), because high accuracy may just reflect the dominant negative class
Switch to a regression model
Precision for the positive class is defined as:
TP / (TP + FN)
(TP + TN) / total
TP / (TP + FP) — of everything predicted positive, the fraction that truly was
TN / (TN + FP)
A cancer screening test should be tuned primarily for high recall. Why?
Because false alarms are the main danger
Because a false negative — missing a real case — is the most costly error, and recall measures the fraction of real positives caught
Because recall ignores the positive class
Because recall is always higher than precision
You raise a classifier's decision threshold from 0.5 to 0.8. What typically happens?
Both precision and recall rise
Both precision and recall fall
Precision tends to rise and recall tends to fall, because the model now demands more confidence before predicting positive
Accuracy is guaranteed to improve
Why does the F1 score use the harmonic mean of precision and recall rather than a simple average?
The harmonic mean is easier to compute
The harmonic mean stays low unless both precision and recall are reasonably high, so it punishes models that sacrifice one for the other
It makes F1 always equal to accuracy
It ignores false negatives
A model reaches 100% recall on the positive class. By itself, what does this guarantee?
The model is excellent
Precision is also 100%
Nothing about overall quality — predicting "positive" for every case also achieves 100% recall, while destroying precision
Accuracy is 100%
Logistic Regression
Despite its name, this is a classification algorithm — a linear score squashed into a probability, then a threshold. The interpretable workhorse for predicting yes-or-no.
ROC Curves and AUC
Precision and recall judge a classifier at one threshold. ROC curves judge it at every threshold at once — and AUC boils that into a single, threshold-free score.