Ensembles and Random Forests

The wisdom of crowds, applied to models. One decision tree is clever but unstable; average hundreds of diverse trees and you get one of the most reliable, hardest-to-beat models in all of tabular machine learning.

On the decision trees page we ended on a frustration: a single tree is clever and readable, but unstable. Nudge the training data a little and the whole tree can rearrange itself. That instability is high variance, and high variance means the model is overfitting to the particular sample it happened to see.

This page is the payoff. It turns out the cure for one unstable model is many unstable models. If you train hundreds of slightly different trees and let them vote, their individual errors — which point in random directions — tend to cancel out, while the signal they agree on reinforces. The result, a random forest, is dramatically more accurate and stable than any single tree, and it is the model most experienced practitioners reach for first on tabular data. Understanding why averaging works is one of the most valuable intuitions in classical machine learning.

The intuition: the wisdom of crowds

There is a famous story: at a county fair, hundreds of people guessed the weight of an ox. No individual guess was exact, but the average of all the guesses was within a pound of the truth — better than any single expert. Each person's error was partly random; averaging cancelled the random part and left the collective signal.

An ensemble applies this to models. Take many models that are each individually mediocre and somewhat independent in their mistakes, then combine their predictions. As long as the models are better than random and their errors are not all identical, the combination beats any one of them. The two requirements are right there: the members must be reasonably good and diverse. A crowd of identical clones tells you nothing new; a crowd of varied, independent guessers is wise.

That diagram is the entire random forest in one picture: many trees, each trained a little differently, all feeding one aggregation step. The art is entirely in how you make the trees diverse.

Ensemble, bagging, random forest — the nesting dolls

An ensemble is any combination of models. Bagging (bootstrap aggregating) is a specific recipe: train each model on a random resample of the data and average them. A random forest is bagging applied to decision trees, plus one extra trick (random feature subsets at each split) to make the trees even more diverse. So: random forest is a kind of bagging, which is a kind of ensemble.

Trick one: bootstrap samples (bagging)

How do you get many different trees from one dataset? The first idea is bootstrapping: instead of training every tree on the same data, give each tree its own random sample drawn with replacement from the training set. "With replacement" means some rows appear two or three times in a given sample and others not at all. Each tree therefore sees a slightly different slice of the world and grows a slightly different structure.

Each sample is a distorted view of the same data — some rows duplicated, others absent. On average, each bootstrap sample omits about 37% of the original rows (those left-out rows have a nice use: they form a built-in validation set, the out-of-bag sample). Training a tree on each distorted sample is what makes the trees disagree, and disagreement is exactly what we need for averaging to help.

Trick two: random feature subsets

Bootstrapping alone is not quite enough, because strong features tend to dominate. If one feature is highly predictive, every tree will likely split on it first, and the trees end up looking alike — defeating the purpose. The random forest's signature trick fixes this: at each split, the tree may only consider a random subset of the features. Sometimes the dominant feature is not even in the running, forcing the tree to discover useful structure in the others.

The combination — bootstrap rows and random feature subsets — produces a collection of genuinely decorrelated trees. That decorrelation is the secret to the variance reduction we are about to measure.

Why decorrelation matters more than it sounds

Averaging n models reduces variance, but only fully if their errors are independent. If all the trees make the same mistakes, averaging them changes nothing. Random feature subsets deliberately weaken the correlation between trees so their errors cancel more effectively. This is the single idea that makes a random forest better than plain bagging.

A random forest beats a single tree

Enough theory — let us see it. We will pit one unconstrained (overfit-prone) decision tree against a random forest of many such trees, on the breast cancer dataset, and compare their test accuracy.

Both models hit (or nearly hit) 1.000 on the training data — individually, the forest's trees overfit just like the lone tree. The difference is on the test set: the single tree's score is noticeably lower, while the forest's is markedly higher. By averaging away the idiosyncratic mistakes of each tree, the forest keeps the genuine signal and discards much of the noise. This is variance reduction made visible.

The headline result of this chapter

A random forest typically matches or beats a single tree on test data while being far more stable — and it usually needs no scaling and little tuning to get there. That combination of strong accuracy and low effort is why it is such a popular default for tabular problems.

Forests for regression, too

Everything transfers to regression by swapping the vote for an average. A RandomForestRegressor builds many regression trees and predicts the mean of their outputs. One pleasant side effect: averaging many staircase predictions yields a much smoother, less jagged function than a single regression tree's blocky steps.

The lone regression tree, grown without limit, overfits and posts a weak test R^2 — sometimes barely above zero. The forest's averaging lifts the test R^2 substantially. Same data, same kind of base model; the only new ingredient is "many trees, averaged."

How many trees? `n_estimators` and diminishing returns

n_estimators sets how many trees the forest grows. A natural question: should you crank it as high as possible? Here the intuition is reassuring and important.

More trees never hurt accuracy — adding trees only refines the average, it cannot cause overfitting the way deepening a single tree does. But the benefit flattens out. The first 50 or 100 trees buy most of the improvement; going from 500 to 1000 barely moves the score while doubling the compute. So n_estimators trades runtime for diminishing accuracy gains, not for risk.

Watch the accuracy climb quickly from a tiny forest, then plateau. A single tree (n_estimators=1) is the shaky baseline; by 25–100 trees the score has essentially settled. The lesson: pick n_estimators large enough to reach the plateau (often a few hundred) and spend your real tuning effort on max_depth, max_features, or min_samples_leaf instead.

More trees is not more overfitting

A common confusion: people assume that since a deeper single tree overfits, a forest with more trees must overfit too. Not so — these are different knobs. Adding trees averages over more samples and only stabilizes the prediction. What controls a forest's complexity is how deep each individual tree is allowed to grow, not how many trees there are.

A peek at feature importances

Because a forest is built from trees, it can report which features were most useful for splitting, aggregated across all the trees: forest.feature_importances_. This is a quick way to see what the model leaned on.

The importances sum to 1 and rank the features by their contribution. Treat this as a hint, not a verdict — these "impurity-based" importances can be misleading (they inflate high-cardinality features and split credit among correlated ones). The honest, model-agnostic way to measure importance, and the caveats around interpreting it, are the subject of the Model Interpretation page. For now, just know the capability exists.

Importance is a hint, not a cause

A high importance means a feature was useful for the forest's splits, not that it causes the outcome or that it would matter to a different model. Resist reading causation into these bars. The Model Interpretation page covers permutation importance and other more trustworthy tools.

When to use a random forest — and when not to

A random forest is an outstanding default when:

You want strong accuracy with minimal fuss. It works well out of the box on tabular data, needs no feature scaling, tolerates irrelevant features, and rarely overfits catastrophically. It is the model to beat.
You have a mix of feature types and nonlinear interactions. Trees handle these natively; the forest makes them reliable.
Stability matters. Where a single tree is brittle, a forest's averaged prediction is steady.

Reach for something else when:

Interpretability is paramount. You traded the single tree's readable flowchart for hundreds of trees you cannot eyeball. If a regulator needs the exact decision logic, a forest is a poor fit — prefer a shallow single tree or a linear model.
You need tiny models or millisecond predictions on constrained hardware. Hundreds of trees are larger and slower to evaluate than one model.
The signal is genuinely linear and smooth. A well-specified linear model may match a forest with far less compute and full interpretability.
You are chasing the absolute top of a leaderboard. Gradient-boosted trees often edge out random forests on accuracy (at the cost of more careful tuning). Boosting is beyond this foundations course, but it is the natural next step.

Common misconceptions about random forests

"More trees can overfit the forest." No — more trees stabilize the average. Per-tree depth controls complexity, not the count.
"A forest is just a bigger decision tree." It is many independent trees whose predictions are combined; there is no single giant tree.
"Feature importances prove causation." They flag what was useful for splitting, with known biases. Interpret cautiously.
"Forests need feature scaling." They do not — like single trees, they are invariant to feature scale.
"A forest is always the best model." It is an excellent default, but smooth-linear problems, interpretability needs, or tight latency budgets can each favor a different model.

Real-world applications

Random forests (and their boosted cousins) quietly run an enormous share of practical, tabular machine learning:

Finance and risk. Credit scoring, fraud detection, and insurance pricing, where robustness and good out-of-the-box accuracy matter.
Healthcare analytics. Predicting readmission risk or disease onset from many mixed clinical measurements.
Industry and operations. Demand forecasting, predictive maintenance, churn prediction — anywhere there is a spreadsheet-shaped problem.
Bioinformatics and ecology. Classic strongholds, partly because forests cope gracefully with many features and noisy measurements.

When a data scientist faces an unfamiliar tabular dataset and wants a strong result quickly, a random forest is very often the first thing they try — and frequently the last, because it is so hard to beat without considerable extra effort.

Your turn

Demonstrate the headline result: averaging many trees beats one overfit tree on held-out data.

Load the data with load_wine(return_X_y=True) into X and y.
Split into train/test with 30% in the test set, random_state=0, and stratified on y.
Train a single DecisionTreeClassifier(random_state=0) (no depth limit) on the training set. Store its test accuracy in tree_acc.
Train a RandomForestClassifier(n_estimators=200, random_state=0) on the training set. Name it forest and store its test accuracy in forest_acc.

The hidden tests check that forest is a 200-tree RandomForestClassifier, that the forest is accurate (test accuracy above 0.95), and — the key point of the page — that the forest's test accuracy is at least as high as the single tree's (forest_acc >= tree_acc).

Check your understanding

QuestionSelect one

What is the core reason a random forest usually generalizes better than a single decision tree?

Each tree in the forest is individually far more accurate than a standalone tree

The forest grows one enormous tree that is too big to overfit

Averaging the predictions of many diverse, individually-overfit trees cancels out their independent errors, reducing variance while keeping the signal they agree on

The forest discards the training data and learns a smooth equation instead

QuestionSelect one

In a random forest, what does training each tree on a bootstrap sample accomplish?

It guarantees every tree sees the exact same data for consistency

It permanently removes 37% of the dataset to speed up training

Each tree is trained on a different random resample (drawn with replacement) of the data, so the trees grow different structures — the diversity that makes averaging effective

It scales the features so distances are comparable

QuestionSelect one

Besides bootstrapping the rows, a random forest also considers only a random subset of features at each split. Why?

To make training slower so the model is more thorough

To ensure every tree uses all features equally

To decorrelate the trees — if one feature is very strong, restricting the candidates at each split prevents every tree from looking the same, so their errors are more independent

Because trees cannot handle more than a few features at once

QuestionSelect one

You increase n_estimators from 100 to 1000. What is the most accurate expectation?

Test accuracy will keep rising sharply, so more is always clearly worth it

The forest will start to overfit because it now has too many trees

Accuracy will change very little (it has likely plateaued), while training and prediction take roughly ten times longer — diminishing returns, not added risk

Accuracy will drop because extra trees add noise to the vote

QuestionSelect one

When is a single decision tree preferable to a random forest?

When you need the highest possible test accuracy

When the features are on very different scales and need handling

When you must be able to read and explain the exact decision logic — a shallow single tree is a literal flowchart, whereas a forest of hundreds of trees is far harder to interpret

Single trees are always faster to train than any forest

Ensembles and Random Forests

On this page