Data Preprocessing and Scaling
Why many models need clean, comparably-scaled numbers — and the one rule about scaling that, if you break it, quietly inflates every score you report.
A machine learning model does not see your data the way you do. Where you see "age in years" and "income in dollars," the model sees two columns of numbers and, for many algorithms, silently assumes they live on the same scale. They almost never do. Age runs from roughly 0 to 100; income runs from thousands to millions. Left unscaled, that mismatch can wreck a model that would otherwise work beautifully.
This page is about preprocessing: the unglamorous step of turning raw numbers into numbers a model can actually use. We focus on the most common and most important transformation — feature scaling — and on the single rule that separates an honest pipeline from a leaky one.
The problem scaling solves
Suppose you are predicting whether a loan will default, and two of your features are the applicant's age (say, 18–80) and their annual income (say, 20,000–500,000). To a human these are obviously different kinds of quantities. To many models they are just numbers, and the one with the bigger numbers shouts louder.
Why does that happen? It comes down to how the model measures things.
- Distance-based models like k-nearest neighbors compute how "far apart" two examples are by, in effect, summing squared differences across features. A difference of 50,000 in income utterly dwarfs a difference of 30 in age. The age feature might as well not exist — the distance is decided almost entirely by income, simply because its numbers are bigger.
- Gradient-based and regularized models like logistic regression learn a weight for each feature and penalize large weights. A feature measured in tiny units needs a large weight to have any effect, and a feature in huge units needs a tiny one. That imbalance makes the optimizer's job harder and makes a one-size-fits-all penalty unfair across features.
Scaling fixes both problems by putting every feature on a comparable footing, so the model judges features by their information, not their units.
Scaling changes the numbers, not the meaning
Scaling is a monotonic, reversible rescaling of each column. It does not throw away information or distort the ordering of values within a feature — a taller person is still taller afterward. It only changes the units so that no single feature dominates purely because it happens to be measured in large numbers.
StandardScaler: turning features into z-scores
The workhorse of scaling is StandardScaler. For each feature (each
column) independently, it subtracts that column's mean and divides by
its standard deviation:
If you have run a hypothesis test, you have seen this exact formula — it is
the z-score. After the transform, each feature has a mean of about 0
and a standard deviation of about 1. A value of +2 now means "two
standard deviations above this feature's average," and that sentence means
the same thing whether the feature was age or income.
Let us see it on real data. The wine dataset has 13 chemical measurements on wildly different scales — some near 0.1, others in the hundreds.
Before scaling, the column means and standard deviations are all over the map. After scaling, every mean is essentially 0 and every standard deviation is essentially 1. The columns are now directly comparable.
Notice the two-step shape: scaler.fit(...) learns the mean and
standard deviation of each column, and scaler.transform(...) applies
them. fit_transform just does both in one call. That fit/transform
split is not a stylistic detail — it is the whole game, as we are about to
see.
fit learns, transform applies
fit is where a transformer looks at data and remembers something (here,
each column's mean and standard deviation). transform uses what was
remembered to change data. Every scikit-learn transformer follows this
pattern, and keeping the two mentally separate is the key to avoiding the
mistake in the next section.
Why it matters: the same model, with and without scaling
Talk is cheap. Let us prove that scaling can change a model's accuracy. We will train a k-nearest-neighbors classifier on the wine data twice — once on the raw features, once on scaled features — using the exact same split so the comparison is fair.
The scaled version is dramatically better. On the raw data, the single
feature with the largest numbers (here, proline, which runs into the
hundreds) dominates the distance calculation, so KNN is effectively
classifying wine by one chemical measurement and ignoring the other
twelve. After scaling, all thirteen measurements get a fair say, and
accuracy jumps. For a distance-based model, scaling is not optional — it is
the difference between a model that works and one that barely tries.
Which models care about scale, and which do not
This is one of the most useful mental models in classical machine learning, so it is worth committing to memory.
Scale-sensitive models — scaling usually helps, sometimes a lot:
- k-nearest neighbors and support vector machines measure distances between points, so a feature with big numbers dominates unless scaled.
- Logistic and linear regression with regularization (the default in
scikit-learn's
LogisticRegression) penalize the size of the weights; that penalty is only fair if features share a scale. Gradient descent also converges faster on scaled inputs. - K-Means clustering and PCA are built on distances and variances, so they are highly scale-sensitive too. (More on these in the clustering pages.)
Scale-invariant models — scaling changes nothing meaningful:
- Decision trees and everything built from them (random forests, gradient boosting). A tree splits a feature with a question like "is income greater than 50,000?" Multiplying that feature by a thousand just changes the threshold to 50,000,000; the order of the values, and therefore every possible split, is identical. Trees care only about ordering, never about magnitude.
Why trees genuinely do not care
A decision tree asks yes/no questions of the form "is this feature above some threshold?" Any scaling that preserves the order of values (which standardizing does — it is just subtract-and-divide by positive numbers) leaves every possible threshold split unchanged. That is why you can hand a random forest raw, wildly-scaled features and lose nothing. It is also why the wrong-vs-right leakage rule below matters far more for KNN and logistic regression than for forests.
The leakage rule: fit on TRAIN only
Here is the most important idea on this page, and the one beginners get
wrong most often. The scaler learns something from data when you fit
it: each column's mean and standard deviation. Those statistics must come
from the training set only. If you compute them using the test set too,
information about the test set has leaked into your training process, and
your test score is no longer an honest estimate of performance on truly
unseen data.
Think back to the train/test split page: the whole point of a test set is that the model knows nothing about it until evaluation. The moment your scaler peeks at the test set to compute a mean, that promise is broken.
The wrong way (leaks)
This looks innocent and runs without error, which is exactly why it is dangerous. The scaler's mean and standard deviation were computed across all rows — including the test rows. The test set helped decide how the training data gets scaled. That is data leakage: the test set is no longer truly held out.
The right way (no leak)
Read those two printouts carefully, because they capture the whole idea. The training mean lands almost exactly at 0 — of course it does, we subtracted the training mean from the training data. The test mean is not exactly 0, and that is correct. We deliberately scaled the test set using the training set's mean and standard deviation, treating it as genuinely new data we are seeing for the first time. In production you would never have a fresh batch's statistics in advance; you only ever have what you learned at training time. Doing it this way means your test score faithfully simulates that reality.
Fit the scaler on train only — always
The rule has no exceptions for ordinary tabular modeling: fit every
preprocessing step on the training set, then transform both the training
and test sets with it. Calling fit_transform on the full dataset before
splitting is the classic leakage bug. It inflates your test score, so you
ship a model that looks great in evaluation and underperforms in
production. The gap is often small enough to miss and large enough to
matter.
A test set that is not perfectly centered is a good sign
If, after scaling, your test set has mean exactly 0 and standard deviation exactly 1, you almost certainly fit the scaler on the test data — a leak. A correctly scaled test set is close to centered (because it is drawn from the same distribution) but not exactly centered. Slight imperfection here is evidence you did it right.
Why the leak inflates your score
It is worth being concrete about the harm. When the scaler sees the test
set, the scaled training data is subtly shaped by the distribution of the
test data. The model then trains on inputs that already "know" a little
about the examples it will be graded on. The grade comes out a touch too
high. In a one-off split the effect can be small, but it compounds badly
inside cross-validation (covered on its own page), where the same leak
happens in every fold and the optimism accumulates. The fix — preprocessing
that is re-fit inside each split — is exactly what Pipeline automates,
which is the subject of a later page.
MinMaxScaler: a brief alternative
StandardScaler centers and standardizes. Sometimes you instead want every
feature squeezed into a fixed range, usually [0, 1]. That is
MinMaxScaler: for each column it subtracts the minimum and divides by the
range (max minus min).
Every column now runs from exactly 0 (its smallest value) to exactly 1 (its largest). When should you prefer it?
- Reach for
MinMaxScalerwhen you need a bounded range — for example, feeding pixel intensities in[0, 1], or when downstream code assumes non-negative inputs in a fixed interval. - Reach for
StandardScaleras the sensible default for most models, especially when features are roughly bell-shaped, and when you do not need a hard boundary.
MinMaxScaler is sensitive to outliers
Because MinMaxScaler divides by max - min, a single extreme value
stretches the range and crams every other value into a tiny sliver near 0.
StandardScaler is steadier under outliers (and RobustScaler, which uses
the median and interquartile range, is steadier still). The same fit-on-
train-only rule applies to all of them.
When NOT to scale, and other misconceptions
Scaling is not a universal good you should sprinkle everywhere. A few honest caveats:
- Tree-based models gain nothing. As shown above, decision trees, random forests, and gradient boosting are invariant to monotonic rescaling. Scaling first is harmless but pointless, and it adds a step that can hide bugs. If your only model is a random forest, you can skip scaling entirely.
- Already-comparable features may not need it. If every feature is, say,
a probability in
[0, 1], or all measured in the same unit on the same scale, scaling buys you little. - Scaling is not cleaning. It does not fill missing values, fix typos,
remove duplicates, or handle outliers. Those are separate jobs. A
NaNwill sail straight through a scaler and crash your model later. - Scaling does not make a bad feature good. It changes units, not information. If a feature is irrelevant, a perfectly scaled version of it is still irrelevant.
Common misconception: 'scaling improves every model'
It improves scale-sensitive models (KNN, SVM, regularized linear models, K-Means, PCA) and does nothing for tree-based models. Knowing which camp your model is in tells you whether scaling is essential, optional, or simply noise in your code. It never hurts accuracy for scale-invariant models — it just does not help.
Real-world applications
Scaling shows up the moment features mix units, which is almost always:
- Credit scoring. Age, income, account balances, and number of accounts live on completely different scales; a regularized logistic model needs them standardized to weigh them fairly.
- Medicine. A diagnostic model might combine resting heart rate (tens), cholesterol (hundreds), and a binary smoking flag (0/1). Without scaling, a distance-based or regularized model is dominated by whichever happens to have larger numbers.
- Recommenders and search that rank by similarity (distance) between feature vectors rely on scaling so that one high-magnitude feature does not silently define "similar."
In every case the modeling step is short; the preprocessing is where the care goes, and where leaks creep in.
Your turn
The breast cancer dataset has 30 features on very different scales. Scale them without leaking the test set.
- The split is already done for you:
X_train,X_test,y_train,y_test. - Create a
StandardScalerand fit it onX_trainonly. - Use it to build
X_train_scaled(transformX_train) andX_test_scaled(transformX_test). Apply the same fitted scaler to both — do not fit it again on the test set. - Store the mean of
X_train_scaledintrain_mean.
The hidden tests check that the scaler was fit on the training set (so the training mean is ~0), that the test set was transformed with the same scaler (so its mean is close to but not exactly 0), and that the shapes are unchanged.
Check your understanding
What does StandardScaler do to each feature?
It maps every value into the fixed range [0, 1]
It subtracts the feature's mean and divides by its standard deviation, so the feature ends up with mean ~0 and standard deviation ~1
It removes outliers from the feature
It converts categorical text into numbers
You scale the entire dataset with fit_transform, then call train_test_split. What is wrong with this?
Nothing — scaling order does not matter
The features will not be centered correctly
The scaler's mean and standard deviation are computed using the test rows too, so information from the test set leaks into training and inflates the test score
It will raise an error at fit time
Which model is essentially unaffected by whether you scale the features first?
k-nearest neighbors
Logistic regression with regularization
A random forest
A support vector machine with an RBF kernel
After correctly scaling (fit on train, transform both), you check the means. What should you expect?
Both the training and test means are exactly 0
The training mean is ~0, while the test mean is close to but not exactly 0
Both means are far from 0
The test mean is exactly 0 but the training mean is not
When would you prefer MinMaxScaler over StandardScaler?
When the data contains extreme outliers you want to keep
When the model is a decision tree
When you need every feature bounded to a fixed interval such as [0, 1]
When you want to remove the mean from each feature
Ensembles and Random Forests
The wisdom of crowds, applied to models. One decision tree is clever but unstable; average hundreds of diverse trees and you get one of the most reliable, hardest-to-beat models in all of tabular machine learning.
Encoding Categorical Features
Models do arithmetic, but categories like "red" and "Tokyo" are words. How to turn them into numbers honestly — and the encoding mistake that quietly teaches your model something false.