Features and Targets
Every supervised machine learning problem reduces to a feature matrix X and a target y. We define both precisely, walk through the types of features and targets, and practice splitting a real table into the two pieces every model expects.
You have now seen the same four-line shape several times: load data into X
and y, split, fit, evaluate. It is time to slow down and ask what X and
y actually are, because almost every modeling decision — what algorithm
to use, what task you are even solving — flows directly from how your data
is shaped into these two objects.
The good news is that the structure is simple and universal. Once you can look at any table and confidently say "these columns are the features, this column is the target," you can frame essentially any supervised problem. That framing is the skill this page builds.
The two pieces every model needs
A supervised machine learning model learns to map inputs to outputs. We give those two roles standard names and standard variable conventions that you will see in every scikit-learn example, every textbook, and every page of this course:
X— the feature matrix. A 2-D table of inputs. Each row is one example (also called a sample or observation); each column is one feature (also called a variable, attribute, or predictor).Xis capitalized because it is a matrix (two dimensions).y— the target. A 1-D list of answers, one per row ofX. It is the thing you want the model to predict.yis lowercase because it is a vector (one dimension).
The single most important alignment rule: row i of X and entry i of
y describe the same example. The features in the first row of X belong
to the answer in the first slot of y. If that correspondence ever breaks,
your model learns nonsense — so we always split X and y together, never
separately.
Read the diagram across, not down: each row of X lines up with one entry of
y. Columns of X are features; the single column y is what we predict.
Hold this picture in mind and the abstract letters become concrete.
Why these exact letters?
The convention comes from the math: a model is a function y = f(X), just
like the y = f(x) you saw in algebra. X is capital because it is a whole
matrix of inputs (many rows, many columns); y is lowercase because it is a
single column of outputs. You do not have to use these names — Python does
not care — but everyone does, so following the convention makes your code
instantly readable to others.
A concrete example: from a table to X and y
Abstractions are slippery, so let us anchor them in a real little table. Imagine a handful of houses we want to price. Each house has a size, a number of bedrooms, an age, and a neighborhood; the thing we want to predict is its price.
We will build this table as a small pandas DataFrame and look at it.
Six houses, five columns. Now comes the central act of framing a supervised
problem: deciding which columns are inputs (features) and which column is
the output (target). Here we want to predict price from everything else,
so price becomes y and the other four columns become X.
That is the entire idea. X.drop(columns=["price"]) keeps every input
column; houses["price"] pulls out the one answer column. The shapes confirm
the structure: X is (6, 4) — six examples, four features each — and y is
(6,) — six answers. Notice X and y still have the same number of
rows, six, which is what keeps row i aligned with answer i.
The most common framing bug
Never leave the target inside X. If price were still a column of X, the
model could "predict" price by simply copying it — perfect training accuracy,
zero real skill. Worse, you could not use the model at all, because at
prediction time you would not have the price (predicting it is the whole
point). Always remove the target from the feature matrix. This is one
flavor of data leakage, a trap the preprocessing chapters return to.
Reading X.shape
X.shape is the first thing experienced practitioners check after building a
feature matrix, because it answers two questions at once: how many examples
do I have (rows) and how many features describe each one (columns).
Whether X is a pandas DataFrame or a plain NumPy array, scikit-learn reads
it the same way: rows are samples, columns are features. The shape is always
(n_samples, n_features). Internalize that ordering — rows first, features
second — and you will rarely be confused about what an array of data
represents.
A sanity check you should run every time
After building X and y, confirm X.shape[0] == len(y). If the number of
rows in X does not equal the number of answers in y, something is
misaligned and scikit-learn will (rightly) refuse to fit. Making this check a
reflex catches a whole category of bugs before they start.
X is always 2-D, even with a single feature
A subtle but extremely common stumbling block: scikit-learn expects X to be
two-dimensional — a matrix — even when you have only one feature. A flat
1-D list of values is not a valid feature matrix; it must be shaped as "one
column," which is a 2-D array with a single column. The target y, by
contrast, is 1-D. Mixing these up produces one of the most frequent beginner
errors.
The rule of thumb is simple: X has two dimensions (samples by features),
y has one (samples). When you have a single feature, reshape(-1, 1)
turns a flat array into the required column shape — the -1 means "infer the
number of rows," and the 1 means "one column." A DataFrame avoids this trap
naturally, because selecting columns with double brackets (df[["col"]])
already returns a 2-D structure.
The error message you will eventually see
If you pass a 1-D array where scikit-learn wants X, you will get an error
like "Expected 2D array, got 1D array instead... Reshape your data using
array.reshape(-1, 1)." When you see it, the fix is exactly what it says:
your feature matrix is flat and needs to be shaped into columns. Now you know
why — X is fundamentally a 2-D table.
Selecting features by name, not by position
One more practical habit. Because the target is defined by meaning, not location, you should select columns by name, never by assuming the target is the first or last column. Names are self-documenting and survive column reordering; positions silently break the moment someone rearranges the table. Here are the idioms you will use constantly.
Both idioms — drop(columns=[...]) to remove the target, or df[[...]] to
list the features explicitly — are common and correct. Use drop when "every
column except the target" is what you want; use the explicit list when you
want precise control over which features enter the model (useful once you
start selecting features deliberately, as the feature engineering chapter
explores).
A quick check
In the feature matrix X, what do the rows and columns represent?
Rows are features; columns are samples
Rows and columns both represent features
Rows are samples (individual examples); columns are features (the measured attributes)
Rows are the target values; columns are the predictions
Types of features
Not all features are alike, and the kind of each feature shapes how you must prepare it before a model can use it. Three types cover most of what you will meet.
Numeric (quantitative) features are plain numbers where arithmetic and
ordering both make sense: size_sqft, age_years, temperature, income.
"Bigger" and "smaller" are meaningful, and the gap between 10 and 20 equals
the gap between 20 and 30. Most algorithms consume numeric features directly,
though many work better when the numbers are put on a comparable scale —
which is what the preprocessing chapter is about.
Categorical (nominal) features are labels with no inherent order:
neighborhood (downtown, suburb, rural), color, country, product category.
There is no sense in which "suburb" is greater than "rural." Models need
numbers, not strings, so categorical features must be encoded — turned
into numbers in a way that does not invent a false order. The
encoding categorical features chapter covers how (one-hot encoding and
friends).
Ordinal features are categories that do have a meaningful order, but
where the spacing between them is not necessarily equal: a size of
small/medium/large, an education level, a survey rating of
poor/fair/good/excellent. The order matters (large > medium > small) but you
cannot assume "large minus medium" equals "medium minus small." Ordinals sit
between numeric and categorical and are encoded with their order preserved.
The classic encoding mistake
A tempting but wrong move is to encode a nominal category as 1, 2, 3 — mapping downtown to 1, suburb to 2, rural to 3. This secretly tells the model that rural (3) is "greater than" downtown (1) and that suburb is exactly halfway between, none of which is true. The model will dutifully act on that fiction. Nominal categories need an encoding (like one-hot) that introduces no false ordering. We devote a full page to doing this correctly.
Types of targets — and the task they imply
Here is the payoff for all this care: the type of your target y
determines what kind of machine learning problem you are solving. This is
one of the most useful diagnostics in the whole field, and it takes only a
glance at y.
- Continuous target (a number on a scale — price, temperature, sales) → you are doing regression. The model predicts a quantity.
- Categorical target (a class label — spam/not-spam, the species of a flower, which of five products) → you are doing classification. The model predicts a category.
- No target at all (you have
Xbut noy) → you are doing unsupervised learning. With no answers to predict, the model instead looks for structure inXitself, such as natural groupings.
So before choosing any algorithm, look at y. A column of dollar amounts
points you at regression; a column of labels points you at classification;
the absence of a y points you at unsupervised methods. The very next two
pages — supervised vs. unsupervised and regression, classification, and
clustering — are built entirely on this distinction, so make sure it feels
solid.
The boundary can be a judgment call
Sometimes the target's type is a modeling choice, not a fact. A 1-to-5 star rating could be treated as a number (regression) or as five ordered categories (classification). House prices bucketed into "low / medium / high" turn a regression into a classification. Part of framing a problem is deciding which view serves your actual goal — and there is often no single right answer, only tradeoffs you will learn to weigh.
Common misconceptions
- "More features always means a better model." Not so. Irrelevant or redundant features add noise and can hurt performance — a problem the feature engineering page tackles head-on. Quality and relevance beat raw count.
- "The target has to be the last column." Position is irrelevant; meaning is everything. The target is whatever you want to predict, wherever it sits in the table. You select it by name, not by location.
- "
Xmust be a NumPy array." A pandas DataFrame works just as well, and is often better because the column names survive — which makes encoding and interpretation far easier. scikit-learn accepts both. - "Categorical features can be dropped into a model as strings." Almost all scikit-learn estimators require numeric input. Categories must be encoded into numbers first; handing a model raw strings will raise an error.
Real-world applications
Framing data as X and y is the universal first step of every supervised
project, and the types involved tell you immediately what you are dealing
with:
- A bank predicting loan default frames each applicant as a row of features (income, history, amount) with a categorical target (default / no default) → classification.
- A retailer forecasting next month's sales builds features from season, promotions, and history with a continuous target (units sold) → regression.
- A hospital estimating length of stay uses patient features with a continuous target (days) → regression — or buckets it into short / medium / long → classification, depending on how the result will be used.
- A streaming service with no labels at all groups viewers by behavior
using only
X→ unsupervised clustering.
In each case the practitioner's first move is identical: identify the features, identify (or note the absence of) the target, check that the rows line up. Everything downstream depends on getting this right.
Your turn
The challenge gives you a small table of patients and asks you to perform the
fundamental split: build X from the feature columns and y from the
target column, correctly aligned. This is the single most common operation in
all of supervised machine learning, and the rest of the course assumes you
can do it without thinking.
A small DataFrame patients is provided. Each row is one
patient. We want to predict whether each patient has the condition — the
has_condition column (1 = yes, 0 = no) — from the other columns.
- Build the target
yfrom thehas_conditioncolumn ofpatients. - Build the feature matrix
Xfrom all the other columns (everything excepthas_condition). Hint:patients.drop(columns=[...]). - Store the number of feature columns in
n_features(useX.shape).
The hidden tests check that y is the right column, that X excludes the
target, that the rows of X and y stay aligned, and that n_features
is correct.
Check your understanding
You have a table where each row is a customer and one column, churned, marks whether they left (1) or stayed (0). You want to predict churn from the other columns. How should X and y be built?
X is the churned column; y is everything else
X and y both include the churned column
y is the churned column; X is all the other columns, with churned removed
X is the first column only; y is the last column only
neighborhood takes the values "downtown", "suburb", and "rural". Why is encoding it as downtown=1, suburb=2, rural=3 a mistake?
Because scikit-learn cannot store integers
Because the numbers are too large for the model
Because it invents a false order and spacing — implying rural (3) is "greater than" downtown (1) and that suburb is exactly halfway — none of which is true for an unordered category
Because categorical features should be deleted, not encoded
A model's target y is a column of house prices in dollars (continuous numbers). What type of machine learning task is this?
Classification, because houses come in categories
Clustering, because we are grouping houses
Regression, because the target is a continuous numeric quantity
Unsupervised learning, because prices vary
What does X.shape of (150, 4) tell you?
150 features describing 4 samples
150 target values and 4 predictions
150 samples (rows), each described by 4 features (columns)
A 150-by-4 image
Which statement about features is correct?
Adding more feature columns always improves the model
The target must always be the final column of the table
Irrelevant or redundant features can add noise and hurt performance, so relevance matters more than sheer count
A pandas DataFrame cannot be used as X; only NumPy arrays work
Why Machine Learning Exists
Machine learning is not a default — it is what you reach for when explicit rules break down. We explore the kinds of problems that defeat hand-coding, and build a clear decision process for when to use ML and when not to.
Supervised vs Unsupervised Learning
The first great fork in machine learning. With labeled answers, you learn a mapping from inputs to outputs (supervised). Without them, you search the data for hidden structure (unsupervised). We make the distinction concrete with code.