Pipelines and ColumnTransformer

How to bundle preprocessing and a model into one object that is impossible to leak — the single most important engineering habit in scikit-learn, and the one that makes cross-validation honest.

By now you have met two preprocessing steps — scaling numbers and encoding categories — and one ironclad rule: fit preprocessing on the training set only, then apply it to everything else. You have also seen how easy it is to break that rule by accident. A misplaced fit_transform, a pd.get_dummies called separately on train and test, a scaler quietly fit on all the data — each leaks the test set and inflates your score.

This page introduces the tools that make doing it right easier than doing it wrong: Pipeline, which chains preprocessing and a model into a single estimator, and ColumnTransformer, which applies different preprocessing to different columns. Together they turn a fragile, multi-step ritual into one object you fit once. This is not a convenience feature — it is how professional scikit-learn code stays correct.

The problem: preprocessing and modeling are easy to desynchronize

Without pipelines, a realistic workflow looks like a chain of manual steps you must perform in the right order, on the right data, every time:

Split into train and test.
Fit a scaler on the training features, transform train, transform test.
Fit an encoder on the training categories, transform train, transform test.
Glue the scaled numbers and encoded categories back together.
Fit the model on the training matrix.
Repeat steps 2–4 exactly on any new data before predicting.

Every one of those steps is a chance to leak the test set (fit on the wrong data), to forget a transform, or to apply transforms in a different order at predict time than at training time. And it gets worse: when you later use cross-validation (its own page), the data is split into folds many times, and the preprocessing must be re-fit fresh inside every fold. Doing that by hand is so tedious that people skip it — and silently leak.

Manual preprocessing is where leaks live

The bug is rarely in the model. It is almost always in the preprocessing: something was fit on data it should not have seen, or the test set was transformed with statistics it should not have known. The more manual steps between your raw data and your model, the more places a leak can hide.

Pipeline: one estimator, one fit, no leaks

A Pipeline chains a sequence of steps — any number of transformers followed by a final estimator (a model) — into a single object that behaves exactly like a model. You build it from a list of (name, step) pairs:

Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LogisticRegression(max_iter=1000)),
])

When you call .fit(X_train, y_train) on this pipeline, it does the right thing automatically: it fits the scaler on X_train, transforms X_train, and feeds the result to the model's fit. When you call .predict(X_test), it transforms X_test with the already-fitted scaler (no re-fitting) and passes that to the model's predict. The fit-on-train / transform-both rule is enforced by construction — you cannot accidentally fit a step on the test set, because you never touch the steps individually.

Notice the asymmetry in those two diagrams: during training the scaler both fits and transforms; during prediction it only transforms, reusing what it learned. That is precisely the discipline you would have to remember by hand — and the pipeline never forgets it.

Here is the simplest possible pipeline, end to end:

The scaling happens inside the pipeline, fit on the training data only, and applied consistently to the test data — and you never wrote a single scaler.fit or scaler.transform call. The leak is impossible to commit here, and the code is shorter to boot.

A pipeline IS an estimator

Anywhere scikit-learn expects a model — train_test_split workflows, cross_val_score, GridSearchCV — you can hand it a whole Pipeline. It exposes the same .fit, .predict, and .score methods. From the outside it is a model; on the inside it is your entire preprocessing-plus-model recipe. This is the key that makes leak-free cross-validation effortless, as the next section explains.

make_pipeline: the same thing, names for free

If you do not care to name the steps, make_pipeline builds the identical object and names each step after its class automatically:

Why the pipeline prevents leakage in cross-validation

This is the reason pipelines matter most, so it deserves its own moment. Cross-validation (covered fully on its own page) repeatedly splits the data into a training portion and a held-out portion, scoring on each. If your scaler was fit once on the whole dataset before cross-validation, then in every fold the "held-out" portion was already seen by the scaler — a leak, repeated in every fold, quietly inflating the average.

When you cross-validate a pipeline, scikit-learn re-fits the entire pipeline — scaler, encoder, and all — separately inside each fold, using only that fold's training portion. The held-out portion of each fold is never seen by any preprocessing step until it is scored. The leak becomes structurally impossible.

Hand cross_val_score a bare model with pre-scaled data and you leak; hand it a pipeline and you do not. Same number of lines, opposite correctness. This is the single strongest argument for building pipelines as a default habit rather than an afterthought.

Cross-validating without a pipeline silently leaks

If you scale or encode the full dataset and then run cross-validation on the transformed data, every fold's validation rows already influenced the preprocessing. The reported score is optimistic. Wrapping preprocessing and model in a Pipeline and cross-validating that is the standard, leak-proof way — and it is why this page comes before the cross-validation page in spirit even when you read them in the other order.

ColumnTransformer: different transforms for different columns

Real datasets are not all numeric or all categorical — they are mixed. You want to StandardScaler the numeric columns and OneHotEncoder the categorical ones, in the same dataset, at the same time. A single Pipeline step applies one transform to all the columns it receives, which is not what you want here.

ColumnTransformer solves this. You give it a list of (name, transformer, columns) triples, and it routes each group of columns to its own transformer, then concatenates the results side by side.

Let us build a realistic mixed dataset inline and assemble the full thing: a ColumnTransformer for preprocessing feeding a LogisticRegression, all wrapped in one Pipeline.

One pipe.fit(X_train, y_train) call: it scaled age and income, one-hot encoded city and plan, glued the columns together, and trained the classifier — each preprocessing step fit on the training data only. You handed it a raw mixed DataFrame and got a trained model, with leakage ruled out by construction.

Seeing the columns the model actually receives

It is worth peeking at what comes out of the ColumnTransformer, because the model never sees your original columns — it sees the transformed ones.

The two numeric columns stay as two columns (now scaled), while city and plan expand into one 0/1 column per category. The num__ and cat__ prefixes tell you which transformer produced each column.

remainder: what happens to columns you did not list

By default, ColumnTransformer drops any column you did not assign to a transformer (remainder="drop"). If you want to keep unlisted columns untouched and pass them straight through, use ColumnTransformer([...], remainder="passthrough"). Being explicit here prevents the silent surprise of a column vanishing — list every column you intend to use.

make_column_transformer for brevity

Just as make_pipeline auto-names pipeline steps, make_column_transformer builds a ColumnTransformer from bare (transformer, columns) pairs and names each block for you. Use whichever reads more clearly to you; they produce equivalent objects.

The whole recipe travels as one object

Because the pipeline is an estimator, the complete preprocessing-plus-model recipe moves around as a single thing. You can pass it to a train/test workflow, to cross_val_score, or (with a small grid) to GridSearchCV for tuning — all covered on their own pages — and the preprocessing rides along, re-fit correctly in each context. You predict on brand-new raw data by calling .predict on the same object, and it applies the identical scaling and encoding it learned at training time.

The new batch included Tokyo, a city the encoder never saw at training time. Because the encoder was created with handle_unknown="ignore", the pipeline handled it gracefully instead of crashing — the exact robustness the encoding page described, now happening automatically inside the pipeline.

When NOT to reach for a pipeline (and misconceptions)

Pipelines are a strong default, but a few honest notes:

Pure exploration. When you are poking at data interactively to see what a transform does, calling a transformer directly is fine and often clearer. Reach for a pipeline once you are training and evaluating a model you will trust.
Steps that are not fit-then-transform. A pipeline composes scikit-learn transformers (objects with fit/transform). One-off cleaning that does not learn anything from the data — dropping a column, parsing a date string — can happen before the pipeline; it does not leak because it learns nothing. (Engineering features with a fit step, however, belongs inside.)
Misconception: a pipeline is just tidier code. Tidiness is a bonus; the real payoff is correctness. A pipeline makes leak-free preprocessing the default behavior, especially under cross-validation. That is a safety property, not a style preference.
Misconception: pipelines change the model. They do not. A pipeline produces the same result as doing every step correctly by hand — it just makes "correctly" automatic. If a pipeline scores differently from your manual code, your manual code almost certainly had a leak.

Common misconception: 'I'll just preprocess once up front'

Preprocessing the entire dataset once, before any splitting or cross-validation, is the very thing that leaks. Preprocessing must be re-fit on each training portion. A pipeline does this for you in every split; doing it by hand across many folds is so error-prone that the pipeline is the right answer in practice.

Real-world applications

Pipelines are the backbone of production scikit-learn:

Tabular prediction at companies — churn, fraud, credit, demand — almost always ships as a single Pipeline of ColumnTransformer plus a model, so the same preprocessing that trained the model also runs at prediction time, with no drift between the two.
Reproducibility and handoff. One object captures the entire recipe, so a teammate (or a future you) can load it, call .predict, and get exactly the training-time behavior — no separate, undocumented preprocessing script to forget.
Honest model selection. Because the whole pipeline cross-validates without leaking, comparisons between models and preprocessing choices are fair. That is the foundation the tuning page builds on.

The shape to remember: raw columns → ColumnTransformer (scale numeric, encode categorical) → model — all inside one Pipeline you fit once.

Your turn

A mixed DataFrame is provided as df with numeric columns hours and prior_score, a categorical column subject, and a binary target column passed. The features are already separated for you into X (the four feature columns) and y (the target).

Build a ColumnTransformer named preprocess with two blocks:

a StandardScaler applied to numeric_cols (["hours", "prior_score"]),
a OneHotEncoder(handle_unknown="ignore", sparse_output=False) applied to categorical_cols (["subject"]).

Build a Pipeline named pipe with two steps: ("prep", preprocess) then ("clf", LogisticRegression(max_iter=1000)).
Fit pipe on X_train, y_train (the split is done for you).
Store the pipeline's test accuracy (via .score) in test_acc.

The hidden tests check that pipe is a fitted Pipeline whose final step is a LogisticRegression, that its preprocessing step is a ColumnTransformer, that the transformer outputs the right number of columns (2 scaled numeric + one column per distinct subject), and that test_acc is a sensible accuracy between 0 and 1.

Check your understanding

QuestionSelect one

What does a Pipeline chain together?

Only several preprocessing transformers, with no model

A sequence of transformers followed by a final estimator (a model), exposed as a single object with one .fit and one .predict

Two separate models whose predictions are averaged

Multiple datasets concatenated into one

QuestionSelect one

Why does wrapping preprocessing and a model in a Pipeline prevent data leakage during cross-validation?

It shuffles the data more thoroughly before each fold

It scales the entire dataset once before any splitting

It re-fits every preprocessing step inside each fold using only that fold's training portion, so the held-out rows never influence the preprocessing

It removes the test set from cross-validation entirely

QuestionSelect one

What problem does ColumnTransformer solve that a single Pipeline step cannot?

It makes the model train faster

It removes the need to split the data

It applies different transformers to different columns — for example, scaling numeric columns while one-hot encoding categorical columns in the same dataset

It guarantees the model reaches 100% accuracy

QuestionSelect one

When you call pipeline.predict(X_new) on a fitted pipeline containing a StandardScaler, what does the scaler do?

It re-fits on X_new to compute fresh statistics

It transforms X_new using the mean and standard deviation it learned during fit, without re-fitting

It is skipped entirely at predict time

It returns the raw X_new unchanged

QuestionSelect one

In ColumnTransformer([...]), what happens by default to a column you did not assign to any transformer?

It is passed through to the model unchanged

It raises an error

It is dropped, because the default is remainder="drop"

It is automatically one-hot encoded

QuestionSelect one

A teammate scales and one-hot encodes the entire dataset, then passes the transformed arrays to cross_val_score with a bare LogisticRegression. What is the consequence?

The code will not run because cross_val_score requires a pipeline

Nothing — preprocessing the whole dataset first is the recommended approach

Each fold's validation rows already influenced the scaling and encoding, so the reported cross-validation score is optimistically biased

The model will underfit and score near zero

Pipelines and ColumnTransformer

On this page