Encoding Categorical Features

Models do arithmetic, but categories like "red" and "Tokyo" are words. How to turn them into numbers honestly — and the encoding mistake that quietly teaches your model something false.

A scikit-learn model multiplies, adds, and compares numbers. It has no idea what "red" means, or that "Tokyo" is a city. Yet real datasets are full of exactly these: color, country, product category, blood type, payment method. Before any model can use them, these categorical features must become numbers.

The obvious way to do that is also, for most categories, the wrong way — and the mistake is subtle enough that it ships in real systems all the time. This page is about encoding categories correctly, why the tempting shortcut backfires, and how OneHotEncoder does it right.

The problem: models need numbers, categories are words

Consider a tiny dataset describing T-shirts:

color	size	city	price
red	M	Tokyo	20
green	L	Paris	25
blue	S	Tokyo	18

The price column is already numeric — a model can use it directly. But color, size, and city are text. If you hand this table straight to LogisticRegression or KNeighborsClassifier, you get an error, because there is no sensible way to compute "red times a weight."

So we must encode: replace each category with numbers. The question is which numbers, and here is where intuition matters more than syntax.

The tempting trap: integer labels for nominal categories

The first idea everyone has is to number the categories: red = 0, green = 1, blue = 2. It is compact and easy. It is also, for most categorical features, a genuine modeling error.

Watch what it implies:

By assigning red = 0, green = 1, blue = 2, you have told the model three things that are simply not true:

An ordering exists: blue (2) is "greater than" red (0).
Distances are meaningful: green (1) is exactly halfway between red and blue, and blue is "twice as far" from red as green is.
Arithmetic is sensible: the average of red and blue is green.

A linear model will dutifully fit a single weight to that column and assume that moving from red to green to blue is a steady, ordered progression. A distance-based model like KNN will think red and green are "close" while red and blue are "far," purely because of the integers you happened to assign. None of that reflects reality. Colors have no order.

The core misconception: integer labels invent a fake order

Encoding an unordered (nominal) category as a single integer column (red=0, green=1, blue=2) makes the model believe in an ordering and in distances that do not exist. The model will treat "blue > green > red" as meaningful and may fit a smooth trend across categories that are not on any scale at all. The bug does not crash — it silently degrades the model and can hide for a long time.

When integer labels are fine: genuinely ordinal data

The shortcut is not always wrong. It is wrong for nominal categories — ones with no inherent order (color, city, payment method). It is perfectly reasonable for ordinal categories — ones with a real, meaningful order.

T-shirt size is ordinal: small < medium < large is a true ordering, and the "distance" from S to L really is larger than from S to M. Encoding it as S = 0, M = 1, L = 2 preserves real information the model can use.

The litmus test is one question: does a meaningful order exist among the categories?

Yes → ordinal → integer encoding is appropriate (scikit-learn's OrdinalEncoder automates it, and you control the order).
No → nominal → integer encoding lies; use one-hot encoding instead.

Ask: would sorting these categories mean anything?

Sorting sizes (S, M, L) into an order is meaningful. Sorting colors (red, green, blue) into an order is arbitrary — any order is as good as any other, which is precisely the signal that an integer encoding would invent information. Ordinal data has a natural sort; nominal data does not.

The fix for nominal categories: one-hot encoding

If colors have no order, we must encode them in a way that treats them as equally different, unordered alternatives. One-hot encoding does exactly that: it replaces the single color column with one new column per category, each holding a 0 or a 1. A row's color is marked by a 1 in its column and 0 everywhere else.

So a single column of ["red", "green", "blue"] becomes three columns:

color	color_red	color_green	color_blue
red	1	0	0
green	0	1	0
blue	0	0	1

Now no category is "greater than" another. Each is its own independent yes/no feature, and the model is free to learn a separate weight for each color with no false ordering imposed. This is why one-hot encoding is the default, safe choice for nominal categories.

OneHotEncoder in scikit-learn

scikit-learn's OneHotEncoder does this, and unlike a one-off pandas trick, it remembers the categories it saw during fit so it can apply the exact same columns at predict time. That memory matters — we will use it in a moment.

Three text columns became several 0/1 columns — one per distinct value across all three features (3 colors + 3 sizes + 3 cities = 9 columns here). get_feature_names_out() tells you exactly which column is which, so you never lose track of what color_blue refers to.

It is sparse_output, not sparse

In current scikit-learn the argument is sparse_output (the older sparse name was removed). By default OneHotEncoder returns a memory-efficient sparse matrix; passing sparse_output=False gives a normal dense array, which is easier to inspect and print while learning. For large datasets with many categories the sparse default saves a great deal of memory.

Seeing the trap cost real accuracy

The integer-label mistake is not just theoretically untidy — it measurably hurts. Let us prove it the way the preprocessing page proved scaling matters: train the same model two ways on the same data, once with integer labels and once with one-hot encoding, and compare.

We will build a dataset where a city's risk is non-monotonic in any ordering: cities Alpha and Charlie are high-risk, while Bravo and Delta are low-risk. No single integer ranking (alphabetical or otherwise) can line those up on a smooth scale, which is exactly the situation integer labels mishandle.

The integer-label model is stuck near coin-flip accuracy, while the one-hot model recovers the real per-city risk almost perfectly. The reason is exactly the false-order problem: a single weight on the integer column can only express a monotonic trend (risk going steadily up or down as the label increases), but the true pattern zig-zags — high, low, high, low — and no ordering of the integers can match it. One-hot encoding gives each city its own weight, so the model fits each city's risk independently. The encoding choice, not the algorithm, decided whether the model could learn.

Why the gap can be dramatic

With integer labels, a linear model is forced to draw one line through "label 0, label 1, label 2, label 3." If the target does not rise or fall steadily with the label, that line fits poorly no matter how it is angled. One-hot encoding removes the constraint entirely by giving every category its own dial. The more non-monotonic the true relationship, the larger the penalty for integer-labeling a nominal feature.

handle_unknown="ignore": surviving unseen categories

Here is a problem one-hot encoding must solve. You fit the encoder on your training data, which contains the cities Tokyo, Paris, and Berlin. Months later, a new record arrives with city London — a value the encoder never saw. What should happen?

By default the encoder raises an error, because it does not have a city_London column to put a 1 in. That is often not what you want in production, where new categories are a fact of life. Setting handle_unknown="ignore" makes the encoder handle the unknown gracefully: it sets all of that feature's one-hot columns to 0 for the unknown value, effectively saying "this record's city is none of the ones I know."

Paris gets its usual one-hot row. London, never seen during fit, becomes all zeros — the encoder does not invent a column or crash; it simply records "unknown." Crucially, the number of output columns stays fixed at what the encoder learned during fit, so the shape your model expects never changes from one batch to the next.

Why fixed output columns matter

A trained model expects an exact number of input columns, in an exact order. Because OneHotEncoder locks in its columns at fit time, every later transform produces that same set of columns — even when new categories show up. Without handle_unknown="ignore", an unseen category would otherwise break this contract. This is also why you must fit the encoder on training data only and reuse it, exactly like the scaler on the preprocessing page: the columns are learned, and that learning must come from the training set.

Encode using the encoder you fit on training data

Do not call pd.get_dummies separately on your train and test sets and hope the columns line up — they will not if a category is missing from one split, and your model will receive mismatched columns. Fit a single OneHotEncoder on the training data and use it to transform everything. (The pipelines page shows how to bundle this with your model so it happens automatically and leak-free.)

The dummy-variable detail (and why scikit-learn does not force it)

You may have heard, from a statistics course, that for $k$ categories you should create only $k-1$ columns — dropping one as a "reference" — to avoid perfect collinearity (the dropped column is implied when all the others are 0). OneHotEncoder supports this via drop="first", but does not do it by default, and for most machine learning models that is fine:

Regularized models (the default LogisticRegression) handle the redundant column without trouble; the penalty resolves the collinearity.
Tree-based models are unaffected by collinearity entirely.

For classical inferential linear regression where you interpret coefficients, dropping a column keeps them identifiable. For prediction with regularized or tree models — the focus of this course — keeping all columns is a perfectly normal default. Know that the option exists; do not agonize over it.

When NOT to one-hot encode

One-hot encoding is the right default for nominal categories, but it is not free, and it is not always the best tool:

High-cardinality features. A zip_code or user_id column with thousands of distinct values would explode into thousands of mostly-zero columns — wide, sparse, and slow. For very high cardinality, other techniques (target encoding, hashing, embeddings) are usually better; one-hot becomes unwieldy.
Genuinely ordinal features. As covered above, if a real order exists (size, education level, satisfaction rating), an ordinal integer encoding preserves that order and is often the better, more compact choice.
Already-numeric features. Do not one-hot encode price or age. They are numbers with meaningful magnitude and order; encoding them as categories would throw away exactly the information you want.
Free text. A column of full sentences or product reviews is not a category with a handful of values; it needs text-specific processing, not one-hot encoding of every unique string.

Common misconception: 'more columns means more information'

One-hot encoding does not add information; it re-represents the same information in a form the model can use without inventing a false order. With very high cardinality, all those extra columns can actually hurt — they add sparsity and dimensions without adding signal, making models slower and more prone to overfitting. Use one-hot when the number of distinct categories is modest.

Real-world applications

Categorical encoding is everywhere tabular data is, because real-world records are full of labels:

Customer churn. Features like subscription plan, country, and device type are all nominal categories one-hot encoded before a model can use them.
Medical prediction. Blood type, symptom presence, and treatment category are categorical; blood type is nominal (one-hot), while a pain scale of mild/moderate/severe is ordinal (integer encoding preserves the order).
E-commerce. Product category, payment method, and shipping region are classic nominal features. Mishandling them — for instance, integer-labeling product categories — is a common, quiet source of weaker models.

The reasoning is always the same: numbers in, but honest numbers that do not assert relationships the world does not contain.

Your turn

A small DataFrame df with two categorical columns, fruit and country, is provided.

Create a OneHotEncoder with handle_unknown="ignore" and sparse_output=False and store it in encoder.
Fit-transform df and store the resulting array in encoded.
Store the encoder's output column names (from get_feature_names_out()) in columns (as a list).

The hidden tests check that encoded has one row per original row, that the total number of one-hot columns equals the number of distinct categories across both features, that the encoded values are only 0s and 1s, and that each row sums to exactly 2 (one '1' for fruit, one '1' for country).

Check your understanding

QuestionSelect one

Why is encoding the nominal feature color as red=0, green=1, blue=2 a modeling mistake?

It uses too much memory compared to one-hot encoding

It invents a false ordering and false distances — the model treats blue (2) as "greater than" red (0) and green as the midpoint, relationships that do not exist among colors

It will raise an error when the model is trained

It removes the color information entirely

QuestionSelect one

For which feature is a single integer encoding (0, 1, 2, ...) genuinely appropriate?

City: New York=0, London=1, Tokyo=2

Color: red=0, green=1, blue=2

T-shirt size: S=0, M=1, L=2

Payment method: cash=0, card=1, crypto=2

QuestionSelect one

What does one-hot encoding produce from a single column of three distinct categories?

A single integer column with values 0, 1, 2

Three new columns, each 0 or 1, where each row has a single 1 marking its category

A single column scaled to the range [0, 1]

One column containing the category name as text

QuestionSelect one

A OneHotEncoder was fit with handle_unknown="ignore". At predict time it meets a category it never saw during fit. What happens?

It raises an error because there is no column for the new category

It adds a brand-new column for the unseen category

It sets all of that feature's one-hot columns to 0 for the unknown value, keeping the output shape fixed

It replaces the unknown value with the most common category from training

QuestionSelect one

When is one-hot encoding a poor choice?

When the categorical feature has only three or four distinct values

When the model is a regularized logistic regression

When the feature has very high cardinality, such as thousands of distinct zip codes or user IDs

When the feature is nominal (unordered)

Encoding Categorical Features

On this page