Pipelines and ColumnTransformer
How to bundle preprocessing and a model into one object that is impossible to leak — the single most important engineering habit in scikit-learn, and the one that makes cross-validation honest.
By now you have met two preprocessing steps — scaling numbers and encoding
categories — and one ironclad rule: fit preprocessing on the training set
only, then apply it to everything else. You have also seen how easy it is
to break that rule by accident. A misplaced fit_transform, a pd.get_dummies
called separately on train and test, a scaler quietly fit on all the data —
each leaks the test set and inflates your score.
This page introduces the tools that make doing it right easier than doing
it wrong: Pipeline, which chains preprocessing and a model into a single
estimator, and ColumnTransformer, which applies different preprocessing to
different columns. Together they turn a fragile, multi-step ritual into one
object you fit once. This is not a convenience feature — it is how
professional scikit-learn code stays correct.
The problem: preprocessing and modeling are easy to desynchronize
Without pipelines, a realistic workflow looks like a chain of manual steps you must perform in the right order, on the right data, every time:
- Split into train and test.
- Fit a scaler on the training features, transform train, transform test.
- Fit an encoder on the training categories, transform train, transform test.
- Glue the scaled numbers and encoded categories back together.
- Fit the model on the training matrix.
- Repeat steps 2–4 exactly on any new data before predicting.
Every one of those steps is a chance to leak the test set (fit on the wrong data), to forget a transform, or to apply transforms in a different order at predict time than at training time. And it gets worse: when you later use cross-validation (its own page), the data is split into folds many times, and the preprocessing must be re-fit fresh inside every fold. Doing that by hand is so tedious that people skip it — and silently leak.
Manual preprocessing is where leaks live
The bug is rarely in the model. It is almost always in the preprocessing: something was fit on data it should not have seen, or the test set was transformed with statistics it should not have known. The more manual steps between your raw data and your model, the more places a leak can hide.
Pipeline: one estimator, one fit, no leaks
A Pipeline chains a sequence of steps — any number of transformers
followed by a final estimator (a model) — into a single object that behaves
exactly like a model. You build it from a list of (name, step) pairs:
Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=1000)),
])When you call .fit(X_train, y_train) on this pipeline, it does the right
thing automatically: it fits the scaler on X_train, transforms X_train,
and feeds the result to the model's fit. When you call .predict(X_test),
it transforms X_test with the already-fitted scaler (no re-fitting) and
passes that to the model's predict. The fit-on-train / transform-both rule
is enforced by construction — you cannot accidentally fit a step on the test
set, because you never touch the steps individually.
Notice the asymmetry in those two diagrams: during training the scaler both fits and transforms; during prediction it only transforms, reusing what it learned. That is precisely the discipline you would have to remember by hand — and the pipeline never forgets it.
Here is the simplest possible pipeline, end to end:
The scaling happens inside the pipeline, fit on the training data only, and
applied consistently to the test data — and you never wrote a single
scaler.fit or scaler.transform call. The leak is impossible to commit
here, and the code is shorter to boot.
A pipeline IS an estimator
Anywhere scikit-learn expects a model — train_test_split workflows,
cross_val_score, GridSearchCV — you can hand it a whole Pipeline. It
exposes the same .fit, .predict, and .score methods. From the outside
it is a model; on the inside it is your entire preprocessing-plus-model
recipe. This is the key that makes leak-free cross-validation effortless,
as the next section explains.
make_pipeline: the same thing, names for free
If you do not care to name the steps, make_pipeline builds the identical
object and names each step after its class automatically:
Why the pipeline prevents leakage in cross-validation
This is the reason pipelines matter most, so it deserves its own moment. Cross-validation (covered fully on its own page) repeatedly splits the data into a training portion and a held-out portion, scoring on each. If your scaler was fit once on the whole dataset before cross-validation, then in every fold the "held-out" portion was already seen by the scaler — a leak, repeated in every fold, quietly inflating the average.
When you cross-validate a pipeline, scikit-learn re-fits the entire pipeline — scaler, encoder, and all — separately inside each fold, using only that fold's training portion. The held-out portion of each fold is never seen by any preprocessing step until it is scored. The leak becomes structurally impossible.
Hand cross_val_score a bare model with pre-scaled data and you leak; hand
it a pipeline and you do not. Same number of lines, opposite correctness.
This is the single strongest argument for building pipelines as a default
habit rather than an afterthought.
Cross-validating without a pipeline silently leaks
If you scale or encode the full dataset and then run cross-validation on
the transformed data, every fold's validation rows already influenced the
preprocessing. The reported score is optimistic. Wrapping preprocessing and
model in a Pipeline and cross-validating that is the standard,
leak-proof way — and it is why this page comes before the cross-validation
page in spirit even when you read them in the other order.
ColumnTransformer: different transforms for different columns
Real datasets are not all numeric or all categorical — they are mixed.
You want to StandardScaler the numeric columns and OneHotEncoder the
categorical ones, in the same dataset, at the same time. A single Pipeline
step applies one transform to all the columns it receives, which is not
what you want here.
ColumnTransformer solves this. You give it a list of
(name, transformer, columns) triples, and it routes each group of columns
to its own transformer, then concatenates the results side by side.
Let us build a realistic mixed dataset inline and assemble the full thing:
a ColumnTransformer for preprocessing feeding a LogisticRegression, all
wrapped in one Pipeline.
One pipe.fit(X_train, y_train) call: it scaled age and income, one-hot
encoded city and plan, glued the columns together, and trained the
classifier — each preprocessing step fit on the training data only. You
handed it a raw mixed DataFrame and got a trained model, with leakage ruled
out by construction.
Seeing the columns the model actually receives
It is worth peeking at what comes out of the ColumnTransformer, because
the model never sees your original columns — it sees the transformed ones.
The two numeric columns stay as two columns (now scaled), while city and
plan expand into one 0/1 column per category. The num__ and cat__
prefixes tell you which transformer produced each column.
remainder: what happens to columns you did not list
By default, ColumnTransformer drops any column you did not assign to a
transformer (remainder="drop"). If you want to keep unlisted columns
untouched and pass them straight through, use
ColumnTransformer([...], remainder="passthrough"). Being explicit here
prevents the silent surprise of a column vanishing — list every column you
intend to use.
make_column_transformer for brevity
Just as make_pipeline auto-names pipeline steps, make_column_transformer
builds a ColumnTransformer from bare (transformer, columns) pairs and
names each block for you. Use whichever reads more clearly to you; they
produce equivalent objects.
The whole recipe travels as one object
Because the pipeline is an estimator, the complete preprocessing-plus-model
recipe moves around as a single thing. You can pass it to a train/test
workflow, to cross_val_score, or (with a small grid) to GridSearchCV for
tuning — all covered on their own pages — and the preprocessing rides along,
re-fit correctly in each context. You predict on brand-new raw data by
calling .predict on the same object, and it applies the identical scaling
and encoding it learned at training time.
The new batch included Tokyo, a city the encoder never saw at training
time. Because the encoder was created with handle_unknown="ignore", the
pipeline handled it gracefully instead of crashing — the exact robustness the
encoding page described, now happening automatically inside the pipeline.
When NOT to reach for a pipeline (and misconceptions)
Pipelines are a strong default, but a few honest notes:
- Pure exploration. When you are poking at data interactively to see what a transform does, calling a transformer directly is fine and often clearer. Reach for a pipeline once you are training and evaluating a model you will trust.
- Steps that are not fit-then-transform. A pipeline composes
scikit-learn transformers (objects with
fit/transform). One-off cleaning that does not learn anything from the data — dropping a column, parsing a date string — can happen before the pipeline; it does not leak because it learns nothing. (Engineering features with a fit step, however, belongs inside.) - Misconception: a pipeline is just tidier code. Tidiness is a bonus; the real payoff is correctness. A pipeline makes leak-free preprocessing the default behavior, especially under cross-validation. That is a safety property, not a style preference.
- Misconception: pipelines change the model. They do not. A pipeline produces the same result as doing every step correctly by hand — it just makes "correctly" automatic. If a pipeline scores differently from your manual code, your manual code almost certainly had a leak.
Common misconception: 'I'll just preprocess once up front'
Preprocessing the entire dataset once, before any splitting or cross-validation, is the very thing that leaks. Preprocessing must be re-fit on each training portion. A pipeline does this for you in every split; doing it by hand across many folds is so error-prone that the pipeline is the right answer in practice.
Real-world applications
Pipelines are the backbone of production scikit-learn:
- Tabular prediction at companies — churn, fraud, credit, demand — almost
always ships as a single
PipelineofColumnTransformerplus a model, so the same preprocessing that trained the model also runs at prediction time, with no drift between the two. - Reproducibility and handoff. One object captures the entire recipe, so
a teammate (or a future you) can load it, call
.predict, and get exactly the training-time behavior — no separate, undocumented preprocessing script to forget. - Honest model selection. Because the whole pipeline cross-validates without leaking, comparisons between models and preprocessing choices are fair. That is the foundation the tuning page builds on.
The shape to remember: raw columns → ColumnTransformer (scale numeric, encode categorical) → model — all inside one Pipeline you fit once.
Your turn
A mixed DataFrame is provided as df with numeric columns
hours and prior_score, a categorical column subject, and a binary
target column passed. The features are already separated for you into
X (the four feature columns) and y (the target).
- Build a
ColumnTransformernamedpreprocesswith two blocks:
- a
StandardScalerapplied tonumeric_cols(["hours", "prior_score"]), - a
OneHotEncoder(handle_unknown="ignore", sparse_output=False)applied tocategorical_cols(["subject"]).
- Build a
Pipelinenamedpipewith two steps:("prep", preprocess)then("clf", LogisticRegression(max_iter=1000)). - Fit
pipeonX_train,y_train(the split is done for you). - Store the pipeline's test accuracy (via
.score) intest_acc.
The hidden tests check that pipe is a fitted Pipeline whose final step
is a LogisticRegression, that its preprocessing step is a
ColumnTransformer, that the transformer outputs the right number of columns
(2 scaled numeric + one column per distinct subject), and that test_acc is
a sensible accuracy between 0 and 1.
Check your understanding
What does a Pipeline chain together?
Only several preprocessing transformers, with no model
A sequence of transformers followed by a final estimator (a model), exposed as a single object with one .fit and one .predict
Two separate models whose predictions are averaged
Multiple datasets concatenated into one
Why does wrapping preprocessing and a model in a Pipeline prevent data leakage during cross-validation?
It shuffles the data more thoroughly before each fold
It scales the entire dataset once before any splitting
It re-fits every preprocessing step inside each fold using only that fold's training portion, so the held-out rows never influence the preprocessing
It removes the test set from cross-validation entirely
What problem does ColumnTransformer solve that a single Pipeline step cannot?
It makes the model train faster
It removes the need to split the data
It applies different transformers to different columns — for example, scaling numeric columns while one-hot encoding categorical columns in the same dataset
It guarantees the model reaches 100% accuracy
When you call pipeline.predict(X_new) on a fitted pipeline containing a StandardScaler, what does the scaler do?
It re-fits on X_new to compute fresh statistics
It transforms X_new using the mean and standard deviation it learned during fit, without re-fitting
It is skipped entirely at predict time
It returns the raw X_new unchanged
In ColumnTransformer([...]), what happens by default to a column you did not assign to any transformer?
It is passed through to the model unchanged
It raises an error
It is dropped, because the default is remainder="drop"
It is automatically one-hot encoded
A teammate scales and one-hot encodes the entire dataset, then passes the transformed arrays to cross_val_score with a bare LogisticRegression. What is the consequence?
The code will not run because cross_val_score requires a pipeline
Nothing — preprocessing the whole dataset first is the recommended approach
Each fold's validation rows already influenced the scaling and encoding, so the reported cross-validation score is optimistically biased
The model will underfit and score near zero
Encoding Categorical Features
Models do arithmetic, but categories like "red" and "Tokyo" are words. How to turn them into numbers honestly — and the encoding mistake that quietly teaches your model something false.
Feature Engineering
A model can only learn from patterns that are visible in the features you give it. Reshaping raw columns into the right representation often beats any fancier algorithm.