Model Interpretation and Feature Importance
A model that works is not enough — you usually need to know why it works. How to ask a model which features drove its predictions, and the traps in every answer it gives.
Suppose you have built a model that predicts, with 96% accuracy, which breast tumors are malignant. Wonderful — but a doctor about to act on that prediction will immediately ask a question your accuracy score cannot answer: why? Which measurements made the model say "malignant"? Would it have said the same thing if one value were a little different? A number on a held-out set tells you that the model is good. It does not tell you how the model thinks, and for most real decisions, that second question matters just as much.
This page is about interpretation — opening the box and asking a trained model which features it leaned on, and how much. You will learn the three most common ways to do this in scikit-learn, and, just as important, the ways each one can quietly mislead you.
Why interpretability matters
A perfectly accurate model you cannot explain is, in many settings, nearly useless. Interpretability is not a nicety bolted on at the end; it is often the difference between a model that ships and one that sits in a notebook.
- Trust. A clinician, a loan officer, or a judge will not — and often legally cannot — act on a prediction they cannot scrutinize. "The model said so" is not a reason anyone can stand behind.
- Debugging. When a model behaves strangely, importance scores are your
flashlight. If a customer-churn model leans almost entirely on a
customer_idcolumn, you have found a bug — that column should carry no real signal, and the model has latched onto an artifact. - Fairness. If a model's decisions hinge heavily on a feature that is a proxy for a protected attribute (a zip code standing in for race, say), you need to see that before it causes harm.
- Stakeholder buy-in. "Late payments and high credit utilization drive this risk score" is a sentence a business can understand, challenge, and act on. A wall of weights is not.
- Scientific insight. Sometimes the model is a means to an end: understanding which factors relate to an outcome is the actual goal, and the predictions are secondary.
Interpretation answers a different question than evaluation
Evaluation asks "is this model any good?" and is answered by metrics on held-out data. Interpretation asks "why does it predict what it predicts?" and is answered by importance scores and explanations. A model can score well and still be uninterpretable, or be perfectly interpretable and score poorly. You usually need both kinds of answer.
Coefficients as importance — for linear and logistic models
The most directly interpretable models are the linear ones. A
LinearRegression or LogisticRegression assigns each feature a single
number — a coefficient — and the prediction is built by multiplying
each feature by its coefficient and adding them up. A coefficient's sign
tells you the direction of the relationship; its magnitude hints at how
strongly that feature pushes the prediction.
This is the appeal of linear models: their parameters are an explanation, for free. Let us read the coefficients of a logistic regression on the breast-cancer data, which ships with named features.
Each coefficient is a small story: this measurement, all else equal, pushes the prediction one way or the other by that much. For a stakeholder, that is gold.
But there is a catch, and it is a big one.
Coefficients depend on feature SCALE
A coefficient's size depends on the units of its feature. A feature measured in millimeters will get a coefficient a thousand times larger than the same feature in meters — yet nothing about the model's behavior changed. You cannot compare raw coefficients across features on different scales. Either standardize the features first (as above), so a one-unit change means "one standard deviation" for every feature, or do not rank features by raw coefficient magnitude at all.
There is a second, subtler trap. When two features are correlated, linear models can split their shared influence between them almost arbitrarily — giving one a large coefficient and the other a tiny one, or even flipping a sign — without changing predictions at all. The model is indifferent to how it divides credit between redundant features, so reading a single coefficient as "this feature's importance" can badly mislead you.
Correlated features confuse coefficients
If two features carry overlapping information, a linear model may load all the weight onto one and starve the other, even reversing a coefficient's sign. The pair together matters, but neither coefficient alone tells you that. With correlated features, treat individual coefficients with suspicion. The breast-cancer data, for instance, has several near-duplicate "mean / worst / error" measurements of the same quantity.
Tree and forest importances — feature_importances_
Tree-based models offer a different, built-in notion of importance. As a
decision tree grows, every split on a feature reduces impurity (it makes
the resulting groups purer in their labels). scikit-learn adds up how much
each feature reduced impurity across all the splits that used it, and
reports the totals as feature_importances_. A feature that was chosen for
many high-value splits scores high; one never split on scores zero.
This works for a single tree and, more reliably, for a whole random forest (averaging over many trees smooths out the noise of any single one).
Two conveniences jump out. Tree importances are always non-negative and sum to 1, so you can read them as "share of the model's decision-making attributed to this feature." And unlike coefficients, they are scale invariant — a tree splits on thresholds, so the units of a feature do not matter. That alone makes them friendlier than raw coefficients.
But impurity-based importance has well-known biases you must know about.
Impurity importance is biased toward high-cardinality features
Impurity-based feature_importances_ tends to inflate features with
many distinct values (high cardinality) — continuous numbers, or an ID-like
column — because such features offer more places to split and can reduce
impurity on the training data almost by accident. The notorious symptom:
add a column of pure random noise with many unique values, and a forest may
assign it a non-trivial importance. Treat impurity importance as a useful
first look, not the final word.
A second limitation: impurity importance is computed from how the tree was built on the training data, so it reflects training-set structure, not necessarily what helps on new data. For an importance measure tied to actual predictive performance, we need a different idea.
Permutation importance — model-agnostic and harder to fool
Permutation importance asks the most direct question imaginable: if I destroy the information in this feature, how much worse does the model get?
The procedure is beautifully simple. Take a trained model and a held-out set. Measure its score. Now randomly shuffle one feature's column — scrambling the link between that feature and the target while keeping its distribution identical — and measure the score again. If the score collapses, that feature was important; if the score barely moves, the model was not really relying on it. Repeat for every feature.
This has three big advantages. It is model-agnostic — it works on any
fitted estimator, from a linear model to a forest to a pipeline, because it
only needs .predict(). It measures importance against real predictive
performance (the score you care about), not training-set impurity. And by
evaluating on held-out data, it asks what helps on new data, not what the
model memorized.
Because the shuffle is random, permutation importance is itself a random
quantity — that is why we repeat it (n_repeats=20) and get a mean and a
standard deviation. A feature whose importance is well above its own noise
band is genuinely being used; one whose importance is a hair from zero
(within its standard deviation) probably is not.
Permutation importance on TEST data, not training data
Run permutation importance on held-out data whenever you can. On the training set, an overfit model may look like it depends heavily on features it merely memorized. On held-out data, you measure what actually helps the model generalize — which is almost always the question you care about.
Permutation importance is not flawless either. With strongly correlated features it can understate importance: if two columns carry the same information, shuffling just one leaves the other intact, so the model barely suffers and both look unimportant — even though the information they share is crucial. No single importance method is immune to correlated features; this is a recurring theme, not a quirk of one technique.
Three lenses, three biases
You now have three ways to ask "which features matter," and each sees the model differently:
- Coefficients (linear models): a direction and a magnitude per feature, but only comparable after scaling, and shaky under correlation.
- Impurity importance (
feature_importances_): free from trees, scale invariant, but biased toward high-cardinality features and tied to the training set. - Permutation importance: model-agnostic and tied to real held-out performance, but can understate correlated features and costs extra compute.
When they agree, trust the ranking. When they disagree, that disagreement is itself a clue worth investigating.
A picture is worth a hundred printed numbers
Importance scores are far easier to read as a bar chart than as a column of numbers. Here we fit a forest and draw its top impurity importances with matplotlib — the kind of plot you will make constantly.
A horizontal bar chart, sorted, with the most important feature on top, is the standard way to present feature importance to a human. It turns a model into a one-glance story: these few measurements are doing most of the work.
Global vs. local explanations
Everything so far has been a global explanation: a single ranking that summarizes the model's behavior across the whole dataset. "Flavanoids and color intensity drive this wine classifier" is a global statement.
There is a second flavor: local explanations, which explain one specific prediction. "For this particular patient, the high worst-radius value is what pushed the model toward 'malignant'" is a local statement. Global tells you how the model behaves on average; local tells you why it made the call it made for a single case — which is often exactly what a person affected by the decision wants to know.
Global and local answer different questions
A global explanation summarizes the model overall (one ranking of features). A local explanation justifies a single prediction (why this row got this output). Specialized tools such as SHAP and LIME focus on local explanations and go beyond this course, but it is worth knowing the distinction: "important in general" and "decisive for this case" are not the same thing.
The misconception that matters most: importance is not causation
Here is the single most important sentence on this page. A feature being important to a model does not mean it causes the outcome. Importance is a statement about the model and the data it was trained on, not about the world.
A model can lean heavily on a feature for reasons that have nothing to do with cause and effect:
- The feature may be a proxy for the true cause. Ice-cream sales might be a top predictor of drownings — not because ice cream causes drowning, but because hot weather drives both. A model happily uses the proxy.
- The feature may leak the answer. If "number of reminder letters sent" is a top predictor of loan default, it may be that the bank sends letters because it already suspects default — the feature is a consequence, not a cause.
- Causation may run the other way, or both features may share a hidden common cause, as with the ice cream above.
This is the same lesson as the old statistics adage correlation is not causation, wearing machine-learning clothes. A high importance score means "the model found this feature useful for prediction given the data it saw." Whether intervening on that feature in the real world would change the outcome is a causal question that importance scores cannot answer — it takes an experiment, or careful causal reasoning, to know.
Do not read importance as a to-do list
The deadliest misuse of feature importance is treating it as advice for action: "color intensity is the most important feature, so let us change color intensity to change the outcome." Importance describes prediction, not intervention. A feature can dominate a model and yet be a useless lever in reality, because changing it would not change the cause it was merely standing in for. To learn what to do, you need causal evidence, not an importance ranking.
When NOT to over-trust importance
- When features are strongly correlated. As we saw, every method distorts under correlation — splitting credit, hiding it, or shuffling it ineffectively. Inspect your correlations before reading too much into any single feature's rank.
- When you have not scaled, for coefficients. Ranking raw coefficients across features on different scales compares millimeters to kilograms. It is meaningless without standardization.
- When the model itself is poor. A model that does not generalize gives importance scores that explain its mistakes. Establish that the model works on held-out data first; only then is "why" a question worth asking.
- When you need causal answers. Importance is the wrong tool for "what should we change?" Reach for an experiment instead.
Real-world applications
Interpretation runs alongside prediction across every serious domain. A credit team must, often by law, tell a rejected applicant which factors drove the decision — straight from model coefficients or importances. A hospital validating a diagnostic model checks that it leans on clinically sensible measurements, not on an artifact like which scanner produced the image. A churn team uses importance to brief the business on why customers leave, turning a model into a strategy. In each case the model's accuracy opened the door, but its interpretability is what let people walk through it.
Your turn
You will train a random forest on the wine dataset and pull out its most important features.
- The data is loaded for you:
X,y, andfeature_names. - Fit a
RandomForestClassifier(n_estimators=200, random_state=0)and store it inforest. - Read its
feature_importances_into a variable calledimportances. - Find the index of the single most important feature and store the
feature's name (a string from
feature_names) intop_feature. Hint:importances.argmax()gives the index. - Build a list
top3of the names of the three most important features, ordered from most to least important.
The hidden tests check that the forest is fitted, that importances has
one value per feature and sums to about 1, that top_feature is the
correct name, and that top3 has the right three names in order.
Check your understanding
Why must you standardize features before comparing the raw coefficients of a linear or logistic regression as "importances"?
Because unstandardized features crash LogisticRegression
Because a coefficient's magnitude depends on its feature's units, so unscaled coefficients compare quantities measured on different scales
Because standardizing makes all coefficients exactly equal
Because coefficients are otherwise always negative
A colleague adds a column of pure random noise (with many unique values) to the training data and is alarmed that the random forest's impurity-based feature_importances_ gives it a non-trivial score. What is going on?
The forest is broken and should be reinstalled
The noise column genuinely causes the target
Impurity-based importance is biased toward high-cardinality features, which offer many split points and can reduce training impurity by chance
Random forests cannot handle continuous features
What does permutation importance actually measure for a given feature?
The coefficient the model assigns to that feature
How many times the feature appears in the training data
How much the model's score drops when that feature's values are randomly shuffled, breaking its link to the target
The correlation between that feature and every other feature
Why is it better to compute permutation importance on a held-out set rather than on the training data?
Held-out data is always larger than training data
It is the only set permutation importance can run on
On the training set an overfit model can look dependent on features it merely memorized, while held-out data measures what actually helps it generalize
Training data has no labels to shuffle
A churn model ranks "number of support tickets" as its most important feature. A manager concludes: "Let us reduce support tickets and customers will stop churning." Why is this reasoning flawed?
Because permutation importance is more accurate than impurity importance
Because the feature should have been scaled first
Because high importance means the feature predicts churn, not that it causes churn — tickets may be a symptom of an underlying problem, and suppressing them would not fix the cause
Because random forests cannot be used for churn prediction
Hyperparameter Tuning
The difference between what a model learns and what you choose for it — and how to choose well without quietly cheating on the test set.
The Practical Machine Learning Workflow
Every idea in this course, assembled into one repeatable, disciplined process — the order you do things in, and why that order is the whole game.