Features and Targets

Every supervised machine learning problem reduces to a feature matrix X and a target y. We define both precisely, walk through the types of features and targets, and practice splitting a real table into the two pieces every model expects.

You have now seen the same four-line shape several times: load data into X and y, split, fit, evaluate. It is time to slow down and ask what X and y actually are, because almost every modeling decision — what algorithm to use, what task you are even solving — flows directly from how your data is shaped into these two objects.

The good news is that the structure is simple and universal. Once you can look at any table and confidently say "these columns are the features, this column is the target," you can frame essentially any supervised problem. That framing is the skill this page builds.

The two pieces every model needs

A supervised machine learning model learns to map inputs to outputs. We give those two roles standard names and standard variable conventions that you will see in every scikit-learn example, every textbook, and every page of this course:

X — the feature matrix. A 2-D table of inputs. Each row is one example (also called a sample or observation); each column is one feature (also called a variable, attribute, or predictor). X is capitalized because it is a matrix (two dimensions).
y — the target. A 1-D list of answers, one per row of X. It is the thing you want the model to predict. y is lowercase because it is a vector (one dimension).

The single most important alignment rule: row i of X and entry i of y describe the same example. The features in the first row of X belong to the answer in the first slot of y. If that correspondence ever breaks, your model learns nonsense — so we always split X and y together, never separately.

Read the diagram across, not down: each row of X lines up with one entry of y. Columns of X are features; the single column y is what we predict. Hold this picture in mind and the abstract letters become concrete.

Why these exact letters?

The convention comes from the math: a model is a function y = f(X), just like the y = f(x) you saw in algebra. X is capital because it is a whole matrix of inputs (many rows, many columns); y is lowercase because it is a single column of outputs. You do not have to use these names — Python does not care — but everyone does, so following the convention makes your code instantly readable to others.

A concrete example: from a table to X and y

Abstractions are slippery, so let us anchor them in a real little table. Imagine a handful of houses we want to price. Each house has a size, a number of bedrooms, an age, and a neighborhood; the thing we want to predict is its price.

We will build this table as a small pandas DataFrame and look at it.

Six houses, five columns. Now comes the central act of framing a supervised problem: deciding which columns are inputs (features) and which column is the output (target). Here we want to predict price from everything else, so price becomes y and the other four columns become X.

That is the entire idea. X.drop(columns=["price"]) keeps every input column; houses["price"] pulls out the one answer column. The shapes confirm the structure: X is (6, 4) — six examples, four features each — and y is (6,) — six answers. Notice X and y still have the same number of rows, six, which is what keeps row i aligned with answer i.

The most common framing bug

Never leave the target inside X. If price were still a column of X, the model could "predict" price by simply copying it — perfect training accuracy, zero real skill. Worse, you could not use the model at all, because at prediction time you would not have the price (predicting it is the whole point). Always remove the target from the feature matrix. This is one flavor of data leakage, a trap the preprocessing chapters return to.

Reading X.shape

X.shape is the first thing experienced practitioners check after building a feature matrix, because it answers two questions at once: how many examples do I have (rows) and how many features describe each one (columns).

Whether X is a pandas DataFrame or a plain NumPy array, scikit-learn reads it the same way: rows are samples, columns are features. The shape is always (n_samples, n_features). Internalize that ordering — rows first, features second — and you will rarely be confused about what an array of data represents.

A sanity check you should run every time

After building X and y, confirm X.shape[0] == len(y). If the number of rows in X does not equal the number of answers in y, something is misaligned and scikit-learn will (rightly) refuse to fit. Making this check a reflex catches a whole category of bugs before they start.

X is always 2-D, even with a single feature

A subtle but extremely common stumbling block: scikit-learn expects X to be two-dimensional — a matrix — even when you have only one feature. A flat 1-D list of values is not a valid feature matrix; it must be shaped as "one column," which is a 2-D array with a single column. The target y, by contrast, is 1-D. Mixing these up produces one of the most frequent beginner errors.

The rule of thumb is simple: X has two dimensions (samples by features), y has one (samples). When you have a single feature, reshape(-1, 1) turns a flat array into the required column shape — the -1 means "infer the number of rows," and the 1 means "one column." A DataFrame avoids this trap naturally, because selecting columns with double brackets (df[["col"]]) already returns a 2-D structure.

The error message you will eventually see

If you pass a 1-D array where scikit-learn wants X, you will get an error like "Expected 2D array, got 1D array instead... Reshape your data using array.reshape(-1, 1)." When you see it, the fix is exactly what it says: your feature matrix is flat and needs to be shaped into columns. Now you know why — X is fundamentally a 2-D table.

Selecting features by name, not by position

One more practical habit. Because the target is defined by meaning, not location, you should select columns by name, never by assuming the target is the first or last column. Names are self-documenting and survive column reordering; positions silently break the moment someone rearranges the table. Here are the idioms you will use constantly.

Both idioms — drop(columns=[...]) to remove the target, or df[[...]] to list the features explicitly — are common and correct. Use drop when "every column except the target" is what you want; use the explicit list when you want precise control over which features enter the model (useful once you start selecting features deliberately, as the feature engineering chapter explores).

A quick check

QuestionSelect one

In the feature matrix X, what do the rows and columns represent?

Rows are features; columns are samples

Rows and columns both represent features

Rows are samples (individual examples); columns are features (the measured attributes)

Rows are the target values; columns are the predictions

Types of features

Not all features are alike, and the kind of each feature shapes how you must prepare it before a model can use it. Three types cover most of what you will meet.

Numeric (quantitative) features are plain numbers where arithmetic and ordering both make sense: size_sqft, age_years, temperature, income. "Bigger" and "smaller" are meaningful, and the gap between 10 and 20 equals the gap between 20 and 30. Most algorithms consume numeric features directly, though many work better when the numbers are put on a comparable scale — which is what the preprocessing chapter is about.

Categorical (nominal) features are labels with no inherent order: neighborhood (downtown, suburb, rural), color, country, product category. There is no sense in which "suburb" is greater than "rural." Models need numbers, not strings, so categorical features must be encoded — turned into numbers in a way that does not invent a false order. The encoding categorical features chapter covers how (one-hot encoding and friends).

Ordinal features are categories that do have a meaningful order, but where the spacing between them is not necessarily equal: a size of small/medium/large, an education level, a survey rating of poor/fair/good/excellent. The order matters (large > medium > small) but you cannot assume "large minus medium" equals "medium minus small." Ordinals sit between numeric and categorical and are encoded with their order preserved.

The classic encoding mistake

A tempting but wrong move is to encode a nominal category as 1, 2, 3 — mapping downtown to 1, suburb to 2, rural to 3. This secretly tells the model that rural (3) is "greater than" downtown (1) and that suburb is exactly halfway between, none of which is true. The model will dutifully act on that fiction. Nominal categories need an encoding (like one-hot) that introduces no false ordering. We devote a full page to doing this correctly.

Types of targets — and the task they imply

Here is the payoff for all this care: the type of your target y determines what kind of machine learning problem you are solving. This is one of the most useful diagnostics in the whole field, and it takes only a glance at y.

Continuous target (a number on a scale — price, temperature, sales) → you are doing regression. The model predicts a quantity.
Categorical target (a class label — spam/not-spam, the species of a flower, which of five products) → you are doing classification. The model predicts a category.
No target at all (you have X but no y) → you are doing unsupervised learning. With no answers to predict, the model instead looks for structure in X itself, such as natural groupings.

So before choosing any algorithm, look at y. A column of dollar amounts points you at regression; a column of labels points you at classification; the absence of a y points you at unsupervised methods. The very next two pages — supervised vs. unsupervised and regression, classification, and clustering — are built entirely on this distinction, so make sure it feels solid.

The boundary can be a judgment call

Sometimes the target's type is a modeling choice, not a fact. A 1-to-5 star rating could be treated as a number (regression) or as five ordered categories (classification). House prices bucketed into "low / medium / high" turn a regression into a classification. Part of framing a problem is deciding which view serves your actual goal — and there is often no single right answer, only tradeoffs you will learn to weigh.

Common misconceptions

"More features always means a better model." Not so. Irrelevant or redundant features add noise and can hurt performance — a problem the feature engineering page tackles head-on. Quality and relevance beat raw count.
"The target has to be the last column." Position is irrelevant; meaning is everything. The target is whatever you want to predict, wherever it sits in the table. You select it by name, not by location.
"X must be a NumPy array." A pandas DataFrame works just as well, and is often better because the column names survive — which makes encoding and interpretation far easier. scikit-learn accepts both.
"Categorical features can be dropped into a model as strings." Almost all scikit-learn estimators require numeric input. Categories must be encoded into numbers first; handing a model raw strings will raise an error.

Real-world applications

Framing data as X and y is the universal first step of every supervised project, and the types involved tell you immediately what you are dealing with:

A bank predicting loan default frames each applicant as a row of features (income, history, amount) with a categorical target (default / no default) → classification.
A retailer forecasting next month's sales builds features from season, promotions, and history with a continuous target (units sold) → regression.
A hospital estimating length of stay uses patient features with a continuous target (days) → regression — or buckets it into short / medium / long → classification, depending on how the result will be used.
A streaming service with no labels at all groups viewers by behavior using only X → unsupervised clustering.

In each case the practitioner's first move is identical: identify the features, identify (or note the absence of) the target, check that the rows line up. Everything downstream depends on getting this right.

Your turn

The challenge gives you a small table of patients and asks you to perform the fundamental split: build X from the feature columns and y from the target column, correctly aligned. This is the single most common operation in all of supervised machine learning, and the rest of the course assumes you can do it without thinking.

A small DataFrame patients is provided. Each row is one patient. We want to predict whether each patient has the condition — the has_condition column (1 = yes, 0 = no) — from the other columns.

Build the target y from the has_condition column of patients.
Build the feature matrix X from all the other columns (everything except has_condition). Hint: patients.drop(columns=[...]).
Store the number of feature columns in n_features (use X.shape).

The hidden tests check that y is the right column, that X excludes the target, that the rows of X and y stay aligned, and that n_features is correct.

Check your understanding

QuestionSelect one

You have a table where each row is a customer and one column, churned, marks whether they left (1) or stayed (0). You want to predict churn from the other columns. How should X and y be built?

X is the churned column; y is everything else

X and y both include the churned column

y is the churned column; X is all the other columns, with churned removed

X is the first column only; y is the last column only

QuestionSelect one

neighborhood takes the values "downtown", "suburb", and "rural". Why is encoding it as downtown=1, suburb=2, rural=3 a mistake?

Because scikit-learn cannot store integers

Because the numbers are too large for the model

Because it invents a false order and spacing — implying rural (3) is "greater than" downtown (1) and that suburb is exactly halfway — none of which is true for an unordered category

Because categorical features should be deleted, not encoded

QuestionSelect one

A model's target y is a column of house prices in dollars (continuous numbers). What type of machine learning task is this?

Classification, because houses come in categories

Clustering, because we are grouping houses

Regression, because the target is a continuous numeric quantity

Unsupervised learning, because prices vary

QuestionSelect one

What does X.shape of (150, 4) tell you?

150 features describing 4 samples

150 target values and 4 predictions

150 samples (rows), each described by 4 features (columns)

A 150-by-4 image

QuestionSelect one

Which statement about features is correct?

Adding more feature columns always improves the model

The target must always be the final column of the table

Irrelevant or redundant features can add noise and hurt performance, so relevance matters more than sheer count

A pandas DataFrame cannot be used as X; only NumPy arrays work

Features and Targets

On this page