Ensembles and Random Forests
The wisdom of crowds, applied to models. One decision tree is clever but unstable; average hundreds of diverse trees and you get one of the most reliable, hardest-to-beat models in all of tabular machine learning.
On the decision trees page we ended on a frustration: a single tree is clever and readable, but unstable. Nudge the training data a little and the whole tree can rearrange itself. That instability is high variance, and high variance means the model is overfitting to the particular sample it happened to see.
This page is the payoff. It turns out the cure for one unstable model is many unstable models. If you train hundreds of slightly different trees and let them vote, their individual errors — which point in random directions — tend to cancel out, while the signal they agree on reinforces. The result, a random forest, is dramatically more accurate and stable than any single tree, and it is the model most experienced practitioners reach for first on tabular data. Understanding why averaging works is one of the most valuable intuitions in classical machine learning.
The intuition: the wisdom of crowds
There is a famous story: at a county fair, hundreds of people guessed the weight of an ox. No individual guess was exact, but the average of all the guesses was within a pound of the truth — better than any single expert. Each person's error was partly random; averaging cancelled the random part and left the collective signal.
An ensemble applies this to models. Take many models that are each individually mediocre and somewhat independent in their mistakes, then combine their predictions. As long as the models are better than random and their errors are not all identical, the combination beats any one of them. The two requirements are right there: the members must be reasonably good and diverse. A crowd of identical clones tells you nothing new; a crowd of varied, independent guessers is wise.
That diagram is the entire random forest in one picture: many trees, each trained a little differently, all feeding one aggregation step. The art is entirely in how you make the trees diverse.
Ensemble, bagging, random forest — the nesting dolls
An ensemble is any combination of models. Bagging (bootstrap aggregating) is a specific recipe: train each model on a random resample of the data and average them. A random forest is bagging applied to decision trees, plus one extra trick (random feature subsets at each split) to make the trees even more diverse. So: random forest is a kind of bagging, which is a kind of ensemble.
Trick one: bootstrap samples (bagging)
How do you get many different trees from one dataset? The first idea is bootstrapping: instead of training every tree on the same data, give each tree its own random sample drawn with replacement from the training set. "With replacement" means some rows appear two or three times in a given sample and others not at all. Each tree therefore sees a slightly different slice of the world and grows a slightly different structure.
Each sample is a distorted view of the same data — some rows duplicated, others absent. On average, each bootstrap sample omits about 37% of the original rows (those left-out rows have a nice use: they form a built-in validation set, the out-of-bag sample). Training a tree on each distorted sample is what makes the trees disagree, and disagreement is exactly what we need for averaging to help.
Trick two: random feature subsets
Bootstrapping alone is not quite enough, because strong features tend to dominate. If one feature is highly predictive, every tree will likely split on it first, and the trees end up looking alike — defeating the purpose. The random forest's signature trick fixes this: at each split, the tree may only consider a random subset of the features. Sometimes the dominant feature is not even in the running, forcing the tree to discover useful structure in the others.
The combination — bootstrap rows and random feature subsets — produces a collection of genuinely decorrelated trees. That decorrelation is the secret to the variance reduction we are about to measure.
Why decorrelation matters more than it sounds
Averaging n models reduces variance, but only fully if their errors are
independent. If all the trees make the same mistakes, averaging them
changes nothing. Random feature subsets deliberately weaken the correlation
between trees so their errors cancel more effectively. This is the single
idea that makes a random forest better than plain bagging.
A random forest beats a single tree
Enough theory — let us see it. We will pit one unconstrained (overfit-prone) decision tree against a random forest of many such trees, on the breast cancer dataset, and compare their test accuracy.
Both models hit (or nearly hit) 1.000 on the training data — individually,
the forest's trees overfit just like the lone tree. The difference is on the
test set: the single tree's score is noticeably lower, while the forest's
is markedly higher. By averaging away the idiosyncratic mistakes of each
tree, the forest keeps the genuine signal and discards much of the noise.
This is variance reduction made visible.
The headline result of this chapter
A random forest typically matches or beats a single tree on test data while being far more stable — and it usually needs no scaling and little tuning to get there. That combination of strong accuracy and low effort is why it is such a popular default for tabular problems.
Forests for regression, too
Everything transfers to regression by swapping the vote for an
average. A RandomForestRegressor builds many regression trees and
predicts the mean of their outputs. One pleasant side effect: averaging many
staircase predictions yields a much smoother, less jagged function than a
single regression tree's blocky steps.
The lone regression tree, grown without limit, overfits and posts a weak test R^2 — sometimes barely above zero. The forest's averaging lifts the test R^2 substantially. Same data, same kind of base model; the only new ingredient is "many trees, averaged."
How many trees? n_estimators and diminishing returns
n_estimators sets how many trees the forest grows. A natural question:
should you crank it as high as possible? Here the intuition is reassuring
and important.
More trees never hurt accuracy — adding trees only refines the average,
it cannot cause overfitting the way deepening a single tree does. But the
benefit flattens out. The first 50 or 100 trees buy most of the
improvement; going from 500 to 1000 barely moves the score while doubling
the compute. So n_estimators trades runtime for diminishing accuracy gains,
not for risk.
Watch the accuracy climb quickly from a tiny forest, then plateau. A single
tree (n_estimators=1) is the shaky baseline; by 25–100 trees the score has
essentially settled. The lesson: pick n_estimators large enough to reach
the plateau (often a few hundred) and spend your real tuning effort on
max_depth, max_features, or min_samples_leaf instead.
More trees is not more overfitting
A common confusion: people assume that since a deeper single tree overfits, a forest with more trees must overfit too. Not so — these are different knobs. Adding trees averages over more samples and only stabilizes the prediction. What controls a forest's complexity is how deep each individual tree is allowed to grow, not how many trees there are.
A peek at feature importances
Because a forest is built from trees, it can report which features were most
useful for splitting, aggregated across all the trees:
forest.feature_importances_. This is a quick way to see what the model
leaned on.
The importances sum to 1 and rank the features by their contribution. Treat this as a hint, not a verdict — these "impurity-based" importances can be misleading (they inflate high-cardinality features and split credit among correlated ones). The honest, model-agnostic way to measure importance, and the caveats around interpreting it, are the subject of the Model Interpretation page. For now, just know the capability exists.
Importance is a hint, not a cause
A high importance means a feature was useful for the forest's splits, not that it causes the outcome or that it would matter to a different model. Resist reading causation into these bars. The Model Interpretation page covers permutation importance and other more trustworthy tools.
When to use a random forest — and when not to
A random forest is an outstanding default when:
- You want strong accuracy with minimal fuss. It works well out of the box on tabular data, needs no feature scaling, tolerates irrelevant features, and rarely overfits catastrophically. It is the model to beat.
- You have a mix of feature types and nonlinear interactions. Trees handle these natively; the forest makes them reliable.
- Stability matters. Where a single tree is brittle, a forest's averaged prediction is steady.
Reach for something else when:
- Interpretability is paramount. You traded the single tree's readable flowchart for hundreds of trees you cannot eyeball. If a regulator needs the exact decision logic, a forest is a poor fit — prefer a shallow single tree or a linear model.
- You need tiny models or millisecond predictions on constrained hardware. Hundreds of trees are larger and slower to evaluate than one model.
- The signal is genuinely linear and smooth. A well-specified linear model may match a forest with far less compute and full interpretability.
- You are chasing the absolute top of a leaderboard. Gradient-boosted trees often edge out random forests on accuracy (at the cost of more careful tuning). Boosting is beyond this foundations course, but it is the natural next step.
Common misconceptions about random forests
- "More trees can overfit the forest." No — more trees stabilize the average. Per-tree depth controls complexity, not the count.
- "A forest is just a bigger decision tree." It is many independent trees whose predictions are combined; there is no single giant tree.
- "Feature importances prove causation." They flag what was useful for splitting, with known biases. Interpret cautiously.
- "Forests need feature scaling." They do not — like single trees, they are invariant to feature scale.
- "A forest is always the best model." It is an excellent default, but smooth-linear problems, interpretability needs, or tight latency budgets can each favor a different model.
Real-world applications
Random forests (and their boosted cousins) quietly run an enormous share of practical, tabular machine learning:
- Finance and risk. Credit scoring, fraud detection, and insurance pricing, where robustness and good out-of-the-box accuracy matter.
- Healthcare analytics. Predicting readmission risk or disease onset from many mixed clinical measurements.
- Industry and operations. Demand forecasting, predictive maintenance, churn prediction — anywhere there is a spreadsheet-shaped problem.
- Bioinformatics and ecology. Classic strongholds, partly because forests cope gracefully with many features and noisy measurements.
When a data scientist faces an unfamiliar tabular dataset and wants a strong result quickly, a random forest is very often the first thing they try — and frequently the last, because it is so hard to beat without considerable extra effort.
Your turn
Demonstrate the headline result: averaging many trees beats one overfit tree on held-out data.
- Load the data with
load_wine(return_X_y=True)intoXandy. - Split into train/test with 30% in the test set,
random_state=0, and stratified ony. - Train a single
DecisionTreeClassifier(random_state=0)(no depth limit) on the training set. Store its test accuracy intree_acc. - Train a
RandomForestClassifier(n_estimators=200, random_state=0)on the training set. Name itforestand store its test accuracy inforest_acc.
The hidden tests check that forest is a 200-tree
RandomForestClassifier, that the forest is accurate (test accuracy above
0.95), and — the key point of the page — that the forest's test accuracy is
at least as high as the single tree's (forest_acc >= tree_acc).
Check your understanding
What is the core reason a random forest usually generalizes better than a single decision tree?
Each tree in the forest is individually far more accurate than a standalone tree
The forest grows one enormous tree that is too big to overfit
Averaging the predictions of many diverse, individually-overfit trees cancels out their independent errors, reducing variance while keeping the signal they agree on
The forest discards the training data and learns a smooth equation instead
In a random forest, what does training each tree on a bootstrap sample accomplish?
It guarantees every tree sees the exact same data for consistency
It permanently removes 37% of the dataset to speed up training
Each tree is trained on a different random resample (drawn with replacement) of the data, so the trees grow different structures — the diversity that makes averaging effective
It scales the features so distances are comparable
Besides bootstrapping the rows, a random forest also considers only a random subset of features at each split. Why?
To make training slower so the model is more thorough
To ensure every tree uses all features equally
To decorrelate the trees — if one feature is very strong, restricting the candidates at each split prevents every tree from looking the same, so their errors are more independent
Because trees cannot handle more than a few features at once
You increase n_estimators from 100 to 1000. What is the most accurate expectation?
Test accuracy will keep rising sharply, so more is always clearly worth it
The forest will start to overfit because it now has too many trees
Accuracy will change very little (it has likely plateaued), while training and prediction take roughly ten times longer — diminishing returns, not added risk
Accuracy will drop because extra trees add noise to the vote
When is a single decision tree preferable to a random forest?
When you need the highest possible test accuracy
When the features are on very different scales and need handling
When you must be able to read and explain the exact decision logic — a shallow single tree is a literal flowchart, whereas a forest of hundreds of trees is far harder to interpret
Single trees are always faster to train than any forest
K-Nearest Neighbors
The laziest algorithm in machine learning — it does no real training at all. To predict, it simply looks at the closest examples it has seen and takes a vote. Simple, intuitive, and a perfect lens for understanding distance and feature scaling.
Data Preprocessing and Scaling
Why many models need clean, comparably-scaled numbers — and the one rule about scaling that, if you break it, quietly inflates every score you report.