Dataslope logoDataslope

Classification Metrics — Beyond Accuracy

Accuracy is the metric everyone reaches for first and the one that misleads most often. The confusion matrix, precision, recall, and F1 tell the real story.

When a model predicts categories, the obvious question is "how often is it right?" — its accuracy. Accuracy is intuitive, easy to compute, and genuinely useful sometimes. It is also responsible for more bad models shipped with confidence than any other number in machine learning. This chapter explains why, and gives you the tools that do not lie: the confusion matrix, precision, recall, and F1.

The accuracy trap

Imagine a disease that affects 2% of patients. Build a "model" that ignores its inputs and predicts "healthy" for everyone. Its accuracy? 98%. It is also completely worthless — it never catches a single sick patient. Let us watch this happen for real.

Code Block
Python 3.13.2

Over 90% accuracy, and it caught zero positive cases. The accuracy is high only because the negative class is huge and easy. On any problem where the rare class is the one you care about — fraud, disease, defaults, defects — accuracy alone is dangerously misleading.

Accuracy hides what matters on imbalanced data

On imbalanced problems, a useless majority-class predictor can post a gorgeous accuracy while completely failing at the actual task. Whenever the classes are imbalanced, treat raw accuracy with deep suspicion and look at per-class metrics instead.

The confusion matrix: where every metric is born

To talk precisely about classifier mistakes, we first need names for the four things that can happen on a binary problem. Compare each prediction to the truth:

  • True Positive (TP) — predicted positive, and it was. A correct catch.
  • False Positive (FP) — predicted positive, but it was negative. A false alarm. (Statisticians call this a Type I error.)
  • False Negative (FN) — predicted negative, but it was positive. A missed case. (A Type II error.)
  • True Negative (TN) — predicted negative, and it was. A correct pass.

The confusion matrix simply counts these four outcomes. scikit-learn lays it out with actual classes as rows and predicted classes as columns.

Code Block
Python 3.13.2

Read the corners carefully

With the default labels [0, 1], the bottom-right cell is TP (actual 1, predicted 1) and the top-left is TN. The off-diagonal cells are the mistakes: top-right is FP, bottom-left is FN. The whole point of the matrix is that it separates the two kinds of mistake, which a single accuracy number blends together.

Precision: when false alarms are expensive

Precision answers: of all the cases the model flagged as positive, what fraction really were?

precision=TPTP+FP\text{precision} = \frac{TP}{TP + FP}
  • What it measures. The trustworthiness of a positive prediction. High precision means "when this model says positive, believe it."
  • What it does not measure. How many positives you missed. A model can have perfect precision by making exactly one extremely confident positive prediction and ignoring every other real case.
  • When it matters most. When false positives are costly. A spam filter that sends a real, important email to the junk folder (FP) has failed badly — so spam filters prize precision. Likewise a model that flags customers for an expensive intervention.

Recall: when misses are expensive

Recall (also called sensitivity) answers: of all the cases that were actually positive, what fraction did the model catch?

recall=TPTP+FN\text{recall} = \frac{TP}{TP + FN}
  • What it measures. Coverage of the real positives. High recall means "few positives slip through."
  • What it does not measure. How many false alarms you raised getting there. A model can reach perfect recall by predicting positive for everyone — catching all real cases and drowning in false positives.
  • When it matters most. When false negatives are costly. A cancer screen that misses a real tumor (FN) can cost a life, so screening prizes recall even at the price of more false alarms (which a follow-up test can rule out).

Precision and recall pull in opposite directions

You can almost always raise one by sacrificing the other. Flag more cases as positive and recall rises while precision falls; flag fewer and the reverse happens. A single number cannot capture a classifier — you need at least this pair, chosen around which mistake hurts more.

The threshold is the dial

Most classifiers do not really output a class — they output a probability, and a threshold (0.5 by default) turns it into a class. Moving that threshold is exactly how you trade precision against recall.

Code Block
Python 3.13.2

Read the table top to bottom: as the threshold rises, the model demands more confidence before saying "positive," so precision climbs and recall falls. There is no universally correct threshold — it depends on whether a false alarm or a missed case is worse for your problem. The default 0.5 is just a convention, not a law.

F1: one number when you must have one

Sometimes you genuinely need a single score — to rank models, to early-stop a search. Averaging precision and recall with a plain mean is too forgiving (a model with precision 1.0 and recall 0.0 would score 0.5). The F1 score uses the harmonic mean, which stays low unless both are decent.

F1=2precisionrecallprecision+recallF_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
  • What it measures. A balance of precision and recall that punishes lopsidedness. F1 is high only when neither is being sacrificed.
  • What it does not measure. The true negatives, and your actual relative cost of FP versus FN — F1 weights precision and recall equally, which may not match your real priorities.
  • When it can mislead. If false positives and false negatives have very different costs, the equal weighting in F1 is wrong for you, and you should optimize precision or recall directly (or a weighted F-beta score).

classification_report: the whole picture at once

In practice you rarely compute these by hand. classification_report prints precision, recall, and F1 for every class, plus support (how many true examples of each class there were).

Code Block
Python 3.13.2

Look at the positive row specifically — its precision and recall are the honest report card on the rare, important class, the one accuracy was hiding. The macro avg averages each metric across classes treating them equally (good for imbalance); the weighted avg weights by support (closer to overall accuracy).

Multi-class: the same ideas, averaged

For more than two classes, precision, recall, and F1 are computed per class (one-vs-rest) and then averaged. macro treats every class equally; weighted accounts for class sizes; micro pools all decisions together. On imbalanced multi-class problems, prefer macro so a tiny class is not ignored.

Choosing the right metric

The metric is not a technicality — it is a statement of values. "We must not miss a real case" means recall. "We must not cry wolf" means precision. Pick the metric that matches the cost of being wrong in your world.

Common misconceptions

  • "High accuracy means a good classifier." Only when classes are balanced and both errors cost the same. Otherwise it can be a mirage.
  • "Precision and recall are basically the same thing." They answer opposite questions — trustworthiness of positive predictions versus coverage of real positives — and improving one often worsens the other.
  • "F1 is always the best single metric." F1 assumes FP and FN matter equally. When they do not, F1 quietly optimizes the wrong thing.
  • "The 0.5 threshold is fixed." It is an arbitrary default. Tuning the threshold to your costs is one of the cheapest, highest-impact moves in applied classification.
  • "Recall of 100% means the model is great." Predicting positive for everything achieves it — with disastrous precision. Always read the pair.

Real-world applications

A fraud team tunes for recall (catch the fraud) while capping false positives so they do not freeze honest customers. An email provider tunes spam filters for precision (never lose a real email). A medical screen maximizes recall and lets a cheap confirmatory test clean up the false alarms. In each case the confusion matrix — not accuracy — is the artifact the team actually argues over.

Your turn

Challenge
Python 3.13.2
Read a classifier's confusion matrix

A LogisticRegression is fit on an imbalanced dataset, and y_test / y_pred are provided (positive class is 1).

  1. Compute the confusion matrix into cm with confusion_matrix(y_test, y_pred).
  2. Unpack it into tn, fp, fn, tp (use cm.ravel()).
  3. Compute precision and recall for the positive class from those four counts using the formulas (not by re-calling sklearn) — precision = tp / (tp + fp) and recall = tp / (tp + fn).

The tests confirm your hand-computed precision and recall match scikit-learn's precision_score and recall_score.

Check your understanding

QuestionSelect one

A model achieves 97% accuracy on a dataset where 96% of cases are negative. What is the most important next check?

Ship it; 97% is excellent

Increase the training set size

Look at per-class metrics (precision/recall on the positive class), because high accuracy may just reflect the dominant negative class

Switch to a regression model

QuestionSelect one

Precision for the positive class is defined as:

TP / (TP + FN)

(TP + TN) / total

TP / (TP + FP) — of everything predicted positive, the fraction that truly was

TN / (TN + FP)

QuestionSelect one

A cancer screening test should be tuned primarily for high recall. Why?

Because false alarms are the main danger

Because a false negative — missing a real case — is the most costly error, and recall measures the fraction of real positives caught

Because recall ignores the positive class

Because recall is always higher than precision

QuestionSelect one

You raise a classifier's decision threshold from 0.5 to 0.8. What typically happens?

Both precision and recall rise

Both precision and recall fall

Precision tends to rise and recall tends to fall, because the model now demands more confidence before predicting positive

Accuracy is guaranteed to improve

QuestionSelect one

Why does the F1 score use the harmonic mean of precision and recall rather than a simple average?

The harmonic mean is easier to compute

The harmonic mean stays low unless both precision and recall are reasonably high, so it punishes models that sacrifice one for the other

It makes F1 always equal to accuracy

It ignores false negatives

QuestionSelect one

A model reaches 100% recall on the positive class. By itself, what does this guarantee?

The model is excellent

Precision is also 100%

Nothing about overall quality — predicting "positive" for every case also achieves 100% recall, while destroying precision

Accuracy is 100%

On this page