Encoding Categorical Features
Models do arithmetic, but categories like "red" and "Tokyo" are words. How to turn them into numbers honestly — and the encoding mistake that quietly teaches your model something false.
A scikit-learn model multiplies, adds, and compares numbers. It has no idea
what "red" means, or that "Tokyo" is a city. Yet real datasets are full
of exactly these: color, country, product category, blood type, payment
method. Before any model can use them, these categorical features must
become numbers.
The obvious way to do that is also, for most categories, the wrong way —
and the mistake is subtle enough that it ships in real systems all the
time. This page is about encoding categories correctly, why the tempting
shortcut backfires, and how OneHotEncoder does it right.
The problem: models need numbers, categories are words
Consider a tiny dataset describing T-shirts:
| color | size | city | price |
|---|---|---|---|
| red | M | Tokyo | 20 |
| green | L | Paris | 25 |
| blue | S | Tokyo | 18 |
The price column is already numeric — a model can use it directly. But
color, size, and city are text. If you hand this table straight to
LogisticRegression or KNeighborsClassifier, you get an error, because
there is no sensible way to compute "red times a weight."
So we must encode: replace each category with numbers. The question is which numbers, and here is where intuition matters more than syntax.
The tempting trap: integer labels for nominal categories
The first idea everyone has is to number the categories: red = 0, green = 1, blue = 2. It is compact and easy. It is also, for most categorical features, a genuine modeling error.
Watch what it implies:
By assigning red = 0, green = 1, blue = 2, you have told the model three things that are simply not true:
- An ordering exists: blue (2) is "greater than" red (0).
- Distances are meaningful: green (1) is exactly halfway between red and blue, and blue is "twice as far" from red as green is.
- Arithmetic is sensible: the average of red and blue is green.
A linear model will dutifully fit a single weight to that column and assume that moving from red to green to blue is a steady, ordered progression. A distance-based model like KNN will think red and green are "close" while red and blue are "far," purely because of the integers you happened to assign. None of that reflects reality. Colors have no order.
The core misconception: integer labels invent a fake order
Encoding an unordered (nominal) category as a single integer column (red=0, green=1, blue=2) makes the model believe in an ordering and in distances that do not exist. The model will treat "blue > green > red" as meaningful and may fit a smooth trend across categories that are not on any scale at all. The bug does not crash — it silently degrades the model and can hide for a long time.
When integer labels are fine: genuinely ordinal data
The shortcut is not always wrong. It is wrong for nominal categories — ones with no inherent order (color, city, payment method). It is perfectly reasonable for ordinal categories — ones with a real, meaningful order.
T-shirt size is ordinal: small < medium < large is a true ordering, and the "distance" from S to L really is larger than from S to M. Encoding it as S = 0, M = 1, L = 2 preserves real information the model can use.
The litmus test is one question: does a meaningful order exist among the categories?
- Yes → ordinal → integer encoding is appropriate (scikit-learn's
OrdinalEncoderautomates it, and you control the order). - No → nominal → integer encoding lies; use one-hot encoding instead.
Ask: would sorting these categories mean anything?
Sorting sizes (S, M, L) into an order is meaningful. Sorting colors (red, green, blue) into an order is arbitrary — any order is as good as any other, which is precisely the signal that an integer encoding would invent information. Ordinal data has a natural sort; nominal data does not.
The fix for nominal categories: one-hot encoding
If colors have no order, we must encode them in a way that treats them as
equally different, unordered alternatives. One-hot encoding does
exactly that: it replaces the single color column with one new column
per category, each holding a 0 or a 1. A row's color is marked by a 1
in its column and 0 everywhere else.
So a single column of ["red", "green", "blue"] becomes three columns:
| color | color_red | color_green | color_blue |
|---|---|---|---|
| red | 1 | 0 | 0 |
| green | 0 | 1 | 0 |
| blue | 0 | 0 | 1 |
Now no category is "greater than" another. Each is its own independent yes/no feature, and the model is free to learn a separate weight for each color with no false ordering imposed. This is why one-hot encoding is the default, safe choice for nominal categories.
OneHotEncoder in scikit-learn
scikit-learn's OneHotEncoder does this, and unlike a one-off pandas
trick, it remembers the categories it saw during fit so it can apply
the exact same columns at predict time. That memory matters — we will use
it in a moment.
Three text columns became several 0/1 columns — one per distinct value
across all three features (3 colors + 3 sizes + 3 cities = 9 columns here).
get_feature_names_out() tells you exactly which column is which, so you
never lose track of what color_blue refers to.
It is sparse_output, not sparse
In current scikit-learn the argument is sparse_output (the older sparse
name was removed). By default OneHotEncoder returns a memory-efficient
sparse matrix; passing sparse_output=False gives a normal dense array,
which is easier to inspect and print while learning. For large datasets with
many categories the sparse default saves a great deal of memory.
Seeing the trap cost real accuracy
The integer-label mistake is not just theoretically untidy — it measurably hurts. Let us prove it the way the preprocessing page proved scaling matters: train the same model two ways on the same data, once with integer labels and once with one-hot encoding, and compare.
We will build a dataset where a city's risk is non-monotonic in any
ordering: cities Alpha and Charlie are high-risk, while Bravo and
Delta are low-risk. No single integer ranking (alphabetical or otherwise)
can line those up on a smooth scale, which is exactly the situation integer
labels mishandle.
The integer-label model is stuck near coin-flip accuracy, while the one-hot model recovers the real per-city risk almost perfectly. The reason is exactly the false-order problem: a single weight on the integer column can only express a monotonic trend (risk going steadily up or down as the label increases), but the true pattern zig-zags — high, low, high, low — and no ordering of the integers can match it. One-hot encoding gives each city its own weight, so the model fits each city's risk independently. The encoding choice, not the algorithm, decided whether the model could learn.
Why the gap can be dramatic
With integer labels, a linear model is forced to draw one line through "label 0, label 1, label 2, label 3." If the target does not rise or fall steadily with the label, that line fits poorly no matter how it is angled. One-hot encoding removes the constraint entirely by giving every category its own dial. The more non-monotonic the true relationship, the larger the penalty for integer-labeling a nominal feature.
handle_unknown="ignore": surviving unseen categories
Here is a problem one-hot encoding must solve. You fit the encoder on
your training data, which contains the cities Tokyo, Paris, and
Berlin. Months later, a new record arrives with city London — a value
the encoder never saw. What should happen?
By default the encoder raises an error, because it does not have a
city_London column to put a 1 in. That is often not what you want in
production, where new categories are a fact of life. Setting
handle_unknown="ignore" makes the encoder handle the unknown gracefully:
it sets all of that feature's one-hot columns to 0 for the unknown
value, effectively saying "this record's city is none of the ones I know."
Paris gets its usual one-hot row. London, never seen during fit,
becomes all zeros — the encoder does not invent a column or crash; it
simply records "unknown." Crucially, the number of output columns stays
fixed at what the encoder learned during fit, so the shape your model
expects never changes from one batch to the next.
Why fixed output columns matter
A trained model expects an exact number of input columns, in an exact
order. Because OneHotEncoder locks in its columns at fit time, every
later transform produces that same set of columns — even when new
categories show up. Without handle_unknown="ignore", an unseen category
would otherwise break this contract. This is also why you must fit the
encoder on training data only and reuse it, exactly like the scaler on the
preprocessing page: the columns are learned, and that learning must come
from the training set.
Encode using the encoder you fit on training data
Do not call pd.get_dummies separately on your train and test sets and
hope the columns line up — they will not if a category is missing from one
split, and your model will receive mismatched columns. Fit a single
OneHotEncoder on the training data and use it to transform everything.
(The pipelines page shows how to bundle this with your model so it happens
automatically and leak-free.)
The dummy-variable detail (and why scikit-learn does not force it)
You may have heard, from a statistics course, that for categories you
should create only columns — dropping one as a "reference" — to avoid
perfect collinearity (the dropped column is implied when all the others are
0). OneHotEncoder supports this via drop="first", but does not do it
by default, and for most machine learning models that is fine:
- Regularized models (the default
LogisticRegression) handle the redundant column without trouble; the penalty resolves the collinearity. - Tree-based models are unaffected by collinearity entirely.
For classical inferential linear regression where you interpret coefficients, dropping a column keeps them identifiable. For prediction with regularized or tree models — the focus of this course — keeping all columns is a perfectly normal default. Know that the option exists; do not agonize over it.
When NOT to one-hot encode
One-hot encoding is the right default for nominal categories, but it is not free, and it is not always the best tool:
- High-cardinality features. A
zip_codeoruser_idcolumn with thousands of distinct values would explode into thousands of mostly-zero columns — wide, sparse, and slow. For very high cardinality, other techniques (target encoding, hashing, embeddings) are usually better; one-hot becomes unwieldy. - Genuinely ordinal features. As covered above, if a real order exists (size, education level, satisfaction rating), an ordinal integer encoding preserves that order and is often the better, more compact choice.
- Already-numeric features. Do not one-hot encode
priceorage. They are numbers with meaningful magnitude and order; encoding them as categories would throw away exactly the information you want. - Free text. A column of full sentences or product reviews is not a category with a handful of values; it needs text-specific processing, not one-hot encoding of every unique string.
Common misconception: 'more columns means more information'
One-hot encoding does not add information; it re-represents the same information in a form the model can use without inventing a false order. With very high cardinality, all those extra columns can actually hurt — they add sparsity and dimensions without adding signal, making models slower and more prone to overfitting. Use one-hot when the number of distinct categories is modest.
Real-world applications
Categorical encoding is everywhere tabular data is, because real-world records are full of labels:
- Customer churn. Features like subscription plan, country, and device type are all nominal categories one-hot encoded before a model can use them.
- Medical prediction. Blood type, symptom presence, and treatment category are categorical; blood type is nominal (one-hot), while a pain scale of mild/moderate/severe is ordinal (integer encoding preserves the order).
- E-commerce. Product category, payment method, and shipping region are classic nominal features. Mishandling them — for instance, integer-labeling product categories — is a common, quiet source of weaker models.
The reasoning is always the same: numbers in, but honest numbers that do not assert relationships the world does not contain.
Your turn
A small DataFrame df with two categorical columns,
fruit and country, is provided.
- Create a
OneHotEncoderwithhandle_unknown="ignore"andsparse_output=Falseand store it inencoder. - Fit-transform
dfand store the resulting array inencoded. - Store the encoder's output column names (from
get_feature_names_out()) incolumns(as a list).
The hidden tests check that encoded has one row per original row, that
the total number of one-hot columns equals the number of distinct categories
across both features, that the encoded values are only 0s and 1s, and that
each row sums to exactly 2 (one '1' for fruit, one '1' for country).
Check your understanding
Why is encoding the nominal feature color as red=0, green=1, blue=2 a modeling mistake?
It uses too much memory compared to one-hot encoding
It invents a false ordering and false distances — the model treats blue (2) as "greater than" red (0) and green as the midpoint, relationships that do not exist among colors
It will raise an error when the model is trained
It removes the color information entirely
For which feature is a single integer encoding (0, 1, 2, ...) genuinely appropriate?
City: New York=0, London=1, Tokyo=2
Color: red=0, green=1, blue=2
T-shirt size: S=0, M=1, L=2
Payment method: cash=0, card=1, crypto=2
What does one-hot encoding produce from a single column of three distinct categories?
A single integer column with values 0, 1, 2
Three new columns, each 0 or 1, where each row has a single 1 marking its category
A single column scaled to the range [0, 1]
One column containing the category name as text
A OneHotEncoder was fit with handle_unknown="ignore". At predict time it meets a category it never saw during fit. What happens?
It raises an error because there is no column for the new category
It adds a brand-new column for the unseen category
It sets all of that feature's one-hot columns to 0 for the unknown value, keeping the output shape fixed
It replaces the unknown value with the most common category from training
When is one-hot encoding a poor choice?
When the categorical feature has only three or four distinct values
When the model is a regularized logistic regression
When the feature has very high cardinality, such as thousands of distinct zip codes or user IDs
When the feature is nominal (unordered)
Data Preprocessing and Scaling
Why many models need clean, comparably-scaled numbers — and the one rule about scaling that, if you break it, quietly inflates every score you report.
Pipelines and ColumnTransformer
How to bundle preprocessing and a model into one object that is impossible to leak — the single most important engineering habit in scikit-learn, and the one that makes cross-validation honest.