Common Pitfalls and Misconceptions

A field guide to the mistakes that quietly sink machine learning projects — what the trap is, why it fools smart people, and exactly how to avoid it.

Most machine learning failures are not exotic. They are the same handful of mistakes, made over and over, by people who absolutely know better in the abstract but slip in the moment. The mistakes share a personality: they produce results that look great, which is precisely why they are so dangerous. A model that crashes is annoying but honest. A model that scores 99% because of a subtle leak is a trap with a bow on it.

This page is a catalog of those traps. For each one: what the trap is, why it fools people, and the fix. Read it once now, then come back to it whenever a result looks too good to be true — because that feeling is usually right.

(a) Data leakage

The trap. Information that would not be available at prediction time sneaks into training, so the model "knows" things it could never really know. The two most common forms are preprocessing before the split and target leakage.

Why it fools people. Leakage produces spectacular validation and even test scores — that is its signature. Everything looks like a triumph. The model only fails once it meets genuinely new data, often in production, where the leaked information is gone. By then the damage is done.

The most common version is so easy to do by accident: you scale or impute using statistics computed from the whole dataset, then split. Now the training fold's scaler already "saw" the mean of the test rows. Watch the difference.

Target leakage is the nastier cousin: a feature secretly encodes the answer. A model predicting whether a patient has a disease, fed a column like was_prescribed_the_disease_drug, will look perfect — it is reading the diagnosis, not predicting it. The feature is a consequence of the target, not a legitimate cause available beforehand.

The fix. Split first, and fit every learned transformation (scaling, imputing, encoding, feature selection) on the training set only — which a Pipeline makes automatic. For target leakage, ask of every feature: would I actually have this value, with this meaning, at the moment I need to predict? If not, drop it.

Leakage is the deadliest pitfall because it looks like success

A crashing model gets fixed. A leaking model gets deployed, because its validation numbers are gorgeous. The discipline of splitting first and wrapping preprocessing in a `Pipeline` exists almost entirely to make leakage hard to commit by accident.

(b) Evaluating on the training data

The trap. Reporting how well the model does on the very data it was trained on, and treating that as its performance.

Why it fools people. The number is always flattering, sometimes a perfect 1.000, and a perfect score feels like the goal. But a flexible model can memorize its training data, so the training score measures memory, not the ability to generalize. It answers a question nobody cares about: "how well does the model recall what it has already seen?"

The fix. Always report performance on held-out data: a test set, or cross-validated scores. The bigger the gap between training and held-out scores, the more the model has overfit. Treat a perfect training score as a red flag, never a trophy.

(c) Ignoring class imbalance

The trap. On a dataset where one class is rare, you celebrate a high accuracy while the model quietly ignores the rare class entirely — which is often the only class you care about.

Why it fools people. Accuracy is intuitive and usually trustworthy, so people reach for it reflexively. But when 99% of cases are class 0, a model that always predicts class 0 — learning nothing — scores 99% accuracy. The number looks superb and the model is useless.

A 99% accuracy and a 0% catch rate on the cases that matter. The fix: on imbalanced problems, look past accuracy to precision, recall, the confusion matrix, and ROC-AUC (from the classification chapters). Use stratify when splitting so the rare class is represented in both halves, and consider class weights or resampling so the model is pushed to care about the minority.

Accuracy is the wrong default for imbalanced data

The rarer the class you care about, the more accuracy flatters a model that ignores it. If a trivial majority-class predictor would already score well, accuracy cannot tell skill from laziness. Reach for recall, precision, and the confusion matrix instead.

(d) Chasing a single metric

The trap. Optimizing one number to the exclusion of everything else, and losing sight of what the number leaves out.

Why it fools people. A single metric is comforting — it makes "better" unambiguous and progress measurable. But every metric is a lossy summary. Maximize precision alone and you may tank recall (catch only the easy, obvious cases). Maximize accuracy and you may trample the rare class. Optimize a proxy hard enough and it stops reflecting the real goal — the metric becomes a target and ceases to be a good measure.

The fix. Choose your metric to match the real cost structure of the problem, and watch a small basket of metrics, not one. A missed fraud and a false fraud alert cost very different amounts; let that asymmetry — not convenience — pick your metric. When two metrics trade off (precision vs. recall), decide the balance you actually want before you optimize.

The metric must encode what you actually care about

Before optimizing anything, ask: what does a mistake cost, and are all mistakes equal? If a false negative is ten times worse than a false positive, accuracy — which treats them the same — is the wrong target. Pick the metric that mirrors real-world costs, then keep an eye on the others it ignores.

(e) Not setting random seeds

The trap. Leaving the random pieces of your workflow — the split, a forest's bootstrap sampling, a KMeans initialization — unseeded, so every run gives different numbers.

Why it fools people. It rarely causes an obvious error, just a slow erosion of trust. You report 0.92, a colleague reruns and gets 0.89, and nobody can tell whether a change helped or you simply caught a luckier shuffle. Comparisons become meaningless because two runs differ for reasons that have nothing to do with the thing you changed.

The fix. Set random_state (and np.random.seed where relevant) on every component that has one: the split, the model, the search. Reproducible results are the bedrock of honest comparison and of anyone being able to reproduce your work — including future you.

(f) Tuning against the test set / repeated peeking

The trap. Using the test score to guide your choices — picking a model, a threshold, a hyperparameter, or just deciding when to stop — and then reporting that same test score as if it were untouched.

Why it fools people. Each individual peek feels harmless: "I just checked, and version B was a bit better." But every decision informed by the test set leaks a little of it into your model, and across dozens of peeks you have effectively trained on the test set through your own choices. The final number becomes an optimistic fiction, and you cannot feel it happening.

The fix. Make all decisions with cross-validation or a separate validation set. The test set is unlocked exactly once, at the very end, after everything is frozen. If you truly need another round of experiments after looking, you need fresh data — the spent test set cannot be reused honestly. This is the same discipline as the practical-workflow chapter, restated because it is violated so often.

Repeated peeking is slow-motion leakage

You do not need to train on the test set to ruin it — you only need to keep consulting it. Every model you keep because it beat the others on the test set has been selected using the test set. Decide with validation data; spend the test set once.

(g) Extrapolating beyond the training range

The trap. Trusting a model's predictions for inputs far outside the range of anything it was trained on.

Why it fools people. A model that is accurate within its training range feels trustworthy everywhere — but it has only ever seen a slice of reality. Outside that slice, it is guessing, and different model types guess in wildly different (and sometimes absurd) ways. A linear model marches its straight line off to infinity; a tree-based model goes flat, predicting a constant, because it can never output a value beyond what it saw in training.

The forest cannot predict anything near 200 — it never saw a target that large and physically cannot output one. The linear model nails this particular case only because the true relationship really is a line, which you would not know in advance. The fix: know your training range, be deeply skeptical of predictions outside it, and never assume a pattern learned in one region holds in another.

(h) Confusing correlation with causation

The trap. Concluding that because a feature predicts the target (or is "important" to the model), changing that feature would change the outcome.

Why it fools people. Predictive power feels like understanding, and a high feature importance feels like a lever. But a feature can predict an outcome because it is a proxy for the real cause, or a symptom of it, or because both share a hidden common cause. The model exploits the association happily; it says nothing about what would happen if you intervened.

A model might find ice-cream sales a strong predictor of drownings. Banning ice cream will not save a single swimmer — hot weather drives both. The fix: keep prediction and causation in separate mental boxes. A model answers "given what I observe, what is likely?" To answer "what should we change?" you need a controlled experiment or careful causal reasoning, not a feature-importance ranking. (The interpretation chapter drills this home.)

Prediction is not intervention

"This feature predicts the outcome" and "changing this feature changes the outcome" are different claims, and a model only ever supports the first. The most confident-sounding misuse of machine learning is reading a predictive model as a recipe for action. It is not.

(i) "More features / more data is always better"

The trap. Believing that throwing in every available feature, or simply piling on rows, must improve the model.

Why it fools people. "More information cannot hurt" sounds like common sense. But more features can absolutely hurt: irrelevant ones add noise the model can overfit to, correlated ones muddle interpretation, and a high-cardinality junk column can even mislead importance measures. More features also means more chances for one of them to be a leak. And more data helps most when it is relevant and representative — a million more rows of the same narrow situation will not teach a model about situations it still has never seen.

The fix: prefer features with a plausible connection to the target, and let evidence — cross-validation, importance, domain knowledge — decide what stays. Quality and relevance beat raw quantity. When you do want more data, seek data that covers new situations, not just more of the same.

The pattern behind every pitfall

Look back and you will see one theme. Almost every trap on this page is a way of fooling yourself about what your model has truly seen and truly learned — leaking the test set, scoring on memorized data, hiding behind a flattering metric, mistaking memory for skill. The cure is always the same posture: relentless honesty about what is held out, what is measured, and what the number really means.

Real-world applications

These are not academic worries; they are the post-mortems of real failed projects. A hospital model that aced validation and flopped in the clinic because a feature leaked the very diagnosis it was meant to predict. A fraud system praised for 99.9% accuracy that never caught a single fraud, because fraud was 0.1% of cases. A business "insight" that changing a predictive feature would move an outcome, acted on at great cost, when the feature was a mere symptom. Every one was preventable by a habit on this page. Learning to expect these traps is a large part of what separates a practitioner who can be trusted from one who merely gets good-looking numbers.

Your turn

The starter code contains a leaky evaluation: it scales the entire dataset with StandardScaler before cross-validating, so each fold's scaler has already seen its validation rows. Your job is to fix it the right way.

Do not pre-scale X. Instead build a Pipeline called pipe with two steps: a StandardScaler named "scaler" and a LogisticRegression(max_iter=5000) named "model".
Cross-validate pipe directly on the raw X, y with cv=5 and store the mean accuracy in correct_cv (use cross_val_score(...).mean()).

By scaling inside the pipeline, each fold is scaled using only its own training portion — no leakage. The hidden tests check that pipe is a proper two-step pipeline (so scaling happens inside cross-validation) and that correct_cv is a believable accuracy.

A dataset is wildly imbalanced: 950 of class 0 and 50 of class 1 (the rare class you care about). A DummyClassifier that always predicts the majority class is already fitted for you as model, and its predictions are in pred.

Compute the plain accuracy of pred against y and store it in acc (use accuracy_score).
Compute the recall on the rare class (class 1) and store it in rare_recall (use recall_score(y, pred, pos_label=1)).

You should find a high acc next to a rare_recall of 0.0 — the model looks great on accuracy yet catches none of the cases that matter. The hidden tests check that acc is high (above 0.9) while rare_recall is exactly 0.0, demonstrating the trap.

Check your understanding

QuestionSelect one

You scale your entire dataset with StandardScaler().fit_transform(X) and then split into train and test. Why is this data leakage?

Because scaling changes the labels y

Because the scaler's mean and standard deviation were computed using the test rows, so test-set information has entered the training transformation

Because StandardScaler should never be used before splitting any data

Because scaling makes the model train more slowly

QuestionSelect one

A model scores 1.000 accuracy on the data it was trained on. What is the right reaction?

Celebrate — the model is perfect and ready to deploy

Treat it as a red flag and check performance on held-out data, since a perfect training score usually signals memorization

Conclude the data must be corrupted

Retrain on even more of the same data to lock in the perfect score

QuestionSelect one

On a dataset that is 99% class 0, a model that always predicts class 0 reports 99% accuracy. What does this reveal?

The model is excellent and needs no further checks

The accuracy metric must be miscalculated

Accuracy can be high while the model ignores the rare class, so on imbalanced data you must also check recall, precision, and the confusion matrix

The rare class should simply be deleted from the dataset

QuestionSelect one

Why is leaving random_state unset across your split, model, and search a problem, even when nothing crashes?

It makes models slower because randomness adds computation

Every run can give different numbers, so you cannot tell whether a change helped or you simply got a luckier random draw

It forces the model to overfit the training data

It changes which metric scikit-learn uses

QuestionSelect one

You repeatedly check the test score while deciding which model and hyperparameters to keep, then report that test score. What have you actually done?

Nothing wrong, since you never explicitly trained on the test set

Leaked the test set through your choices, so its reported score is now optimistic rather than an honest estimate of unseen performance

Made the test set more reliable by checking it often

Guaranteed the model will generalize, since the test score was high

QuestionSelect one

A random forest is trained on inputs x in the range 0 to 10. You ask it to predict at x = 100. What should you expect?

It will accurately extrapolate the trend out to x = 100

It will raise an error because the input is out of range

Its prediction will be roughly flat near the edge of the training range, because a forest cannot output values beyond what it saw in training

It will predict exactly the true value, whatever the relationship

QuestionSelect one

A model finds that "number of umbrellas sold" strongly predicts "number of car accidents." A planner proposes banning umbrella sales to reduce accidents. What is the flaw?

The model used the wrong features and should be retrained

Predicting accidents is impossible, so the model must be wrong

Umbrella sales predict accidents only because rain drives both; the association is not causal, so banning umbrellas would not reduce accidents

The model needs more umbrella-related features to be accurate

QuestionSelect one

When is adding more features likely to hurt a model rather than help it?

Never — additional features can only add information and improve a model

Only when the features are perfectly correlated with the target

When the added features are irrelevant or noisy, giving a flexible model extra opportunities to overfit and obscuring the real signal

Adding features always helps as long as you have enough rows

Common Pitfalls and Misconceptions

On this page