Types of Data
Why a variable's type decides which summaries, charts, and tests are valid — categorical vs numerical, the four measurement scales, and the encoding traps that make people average things that can't be averaged.
Before you compute a single statistic, you have to answer a question that sounds trivial but quietly governs everything: what kind of data is this? The type of a variable decides which summaries make sense, which charts are honest, and which tests are even legal. Take the mean of the wrong kind of column and you'll get a number — a clean, confident, completely meaningless number.
The classic example: a survey stores favorite-flavor responses as 1 = vanilla, 2 = chocolate, 3 = strawberry. Average them and you might get
1.8. There is no flavor 1.8. The arithmetic ran fine; the meaning
is nonsense, because those numbers are labels, not quantities. This
page is about telling those apart — by eye and in pandas — so you never
ship a 1.8-flavored conclusion.
Why this page comes before the statistics
Every later technique assumes you've classified your variables correctly. Measures of Center asks "mean or mode?" — the answer is a type question. Visualizing Distributions asks "histogram or bar chart?" — also a type question. Pick the right type once and a dozen downstream choices become obvious.
The big split: categorical vs. numerical
At the top level, every variable is either categorical (it labels which group something belongs to) or numerical (it measures how much or how many). Each splits once more:
- Nominal — pure labels with no inherent order: country,
payment method, color,
user_id. You can count how many fall in each category, but "greater than" is meaningless. - Ordinal — categories with a meaningful order but unknown, uneven gaps: shirt sizes S < M < L, a 1–5 satisfaction rating, education level. You can rank them, but the distance between "satisfied" and "very satisfied" isn't guaranteed equal to the distance between "neutral" and "satisfied."
- Discrete numerical — counts you get by tallying whole things: number of logins, defects per batch, children per household. Usually non-negative integers.
- Continuous numerical — measurements that can take any value in a range (limited only by instrument precision): height, temperature, price, response time.
A fast field test
Ask two questions. (1) Can you order the values? No → nominal. (2) If ordered, are the gaps equal and is arithmetic meaningful? No → ordinal. Yes → numerical (then ask if it's a count → discrete, or a measurement → continuous). Most classification mistakes come from skipping question 2 and treating ordered labels as real numbers.
The four measurement scales (the intuition)
A slightly finer lens, due to S.S. Stevens, splits data into four measurement scales. You don't need the theory — you need the practical rule each scale implies: which operations are meaningful.
| Scale | What it adds | Example | Mean OK? | "True zero"? |
|---|---|---|---|---|
| Nominal | Names only | eye color, country | No | No |
| Ordinal | Order | S/M/L, Likert 1–5 | No (use median/mode) | No |
| Interval | Equal gaps | temperature in °C, calendar year | Yes (differences) | No |
| Ratio | True zero | height, price, count, duration | Yes (and ratios) | Yes |
The ladder is cumulative: each scale keeps the powers of the ones above it and adds one. The two boundaries that trip people up:
- Ordinal → interval. Ordinal tells you the order but not the spacing. Averaging it pretends the gaps are equal when they aren't.
- Interval → ratio. Interval has equal gaps but no true zero, so ratios are meaningless. 20 °C is not "twice as hot" as 10 °C (the zero is arbitrary). Ratio scales have a real zero, so "twice as long" or "half the price" are valid.
Common misconception: treating ordinal as interval
The single most common scale error in data work is averaging an ordinal variable. A 1–5 star rating looks numeric, so people report "average rating 4.2." But the gap from 1 to 2 may not equal the gap from 4 to 5, so the mean is built on an assumption you can't justify. The honest summaries are the median, the mode, and the full distribution of counts (how many 1s, 2s, …, 5s). Means of Likert items are common in practice and sometimes defensible, but treat them as a convenient approximation, not a rigorous summary — and never for nominal codes.
Choosing summaries by type
The payoff of all this taxonomy is a simple lookup: once you know the type, the right summary and chart almost pick themselves.
- Categorical (nominal or ordinal): count categories with
value_counts(); report the mode (most frequent) and proportions; visualize with a bar chart. For ordinal, the median category is also meaningful. - Numerical: report center (mean if roughly symmetric, median if skewed) and spread (standard deviation or IQR); visualize with a histogram or box plot. We dig into these choices in Measures of Center and Visualizing Distributions.
Notice that user_id is stored as an integer, so pandas reports its
dtype as int64 — but it is nominal: the IDs are names that happen
to look like numbers. The next section is about exactly that trap.
Encoding traps: when numbers aren't quantities
The most dangerous columns are the ones that look numerical but
aren't. pandas will happily compute df["zip_code"].mean() and hand
you a number. The dtype is int64; the meaning is gibberish.
The averaged zip code (~48,705) isn't a place; it's the centroid of
four arbitrary labels. The rating mean is less wrong but still rests
on assuming equal spacing. Only price_usd — a true ratio quantity —
earns its mean.
The numeric-looking ID trap
If a column's numbers are identifiers or codes — user_id,
zip_code, store_number, product_sku, encoded categories like
1=vanilla — then arithmetic on them is meaningless no matter what
the dtype says. A reliable tell: would value + value or value / 2
mean anything? For a zip code or a flavor code, no. Cast these to
category (or str) so you don't accidentally average a label. The
computer can't tell a quantity from a code — that judgment is yours.
A postal_code column is stored as integers. A teammate computes its mean to "find the typical region." What's the problem?
The mean is fine; it gives the geographic center
Postal codes should be floats, not integers
Postal code is nominal (a label that happens to look numeric), so averaging it is meaningless; count or mode is appropriate
The mean is only valid if the codes are sorted first
pandas dtypes vs. statistical type
A crucial habit: a pandas dtype is not a statistical type. The
dtype tells you how the bytes are stored (int64, float64, object,
category); the statistical type is a judgment about meaning that
only you can make. They often disagree.
| Statistical type | Typical pandas dtype | But watch out |
|---|---|---|
| Nominal | object, category | …or int64 for ID/zip codes |
| Ordinal | category (ordered) | …often a bare int64 for ratings |
| Discrete numeric | int64 | …genuinely counts, arithmetic OK |
| Continuous numeric | float64 | usually trustworthy |
Two practical moves. First, convert genuine categoricals to the
category dtype — it saves memory and signals intent. Second, for
ordinal data, make an ordered categorical so pandas knows
S < M < L, which unlocks a correct .median() and proper sorting.
Likert scales: the gray zone
Survey scales ("Strongly disagree" … "Strongly agree", coded 1–5) are ordinal. Purists summarize them with the median, mode, and the distribution of responses. In practice, analysts very often average Likert items (and treat sums of many items as roughly interval) — it's a widespread, useful approximation. Just know you're assuming equal spacing when you do, report it honestly, and never extend the habit to nominal codes, where a mean has no meaning at all.
Classify and summarize correctly
A DataFrame df has several columns. Build a dict named types mapping each column name to its statistical type as one of these exact lowercase strings:
"nominal"— unordered label (including numeric-looking IDs/codes)"ordinal"— ordered categories with uneven gaps"discrete"— countable whole-number quantity"continuous"— measured quantity that can take any value in a range
Classify by meaning, not by dtype. The columns are:
customer_id(an identifier),country(a label),satisfaction(1–5 rating),num_purchases(a count),account_balance(a dollar amount).
types must have exactly those 5 keys.
You're given three columns of different types. Compute the type-appropriate summary for each into a dict named summary:
"country_mode"— the mode (most common value) of the nominalcountrycolumn, as a string. Usedf["country"].mode()[0]."rating_median"— the median of the ordinalratingcolumn, as a float."price_mean"— the mean of the continuouspricecolumn, as a float.
The point: mode for nominal, median for ordinal, mean for continuous — not a mean for all three.
The reflex to build
Before you call .mean() on any column, ask: is this column a
quantity, or a label/rank wearing a number costume? Quantities
(discrete, continuous, ratio/interval) can be averaged. Labels
(nominal) and ranks (ordinal) want counts, modes, and medians instead.
Why type also decides which tests are valid
This page is the gateway to the inference chapters, because the type of your variables decides which statistical test is appropriate — not just which summary. A quick preview of where this leads:
- Comparing a numerical outcome across two groups → a t-test (see T-Tests).
- Comparing a numerical outcome across three or more groups → ANOVA.
- Testing whether two categorical variables are associated → a chi-square test.
- Measuring the relationship between two numerical variables → correlation.
We cover these in ANOVA and Chi-Square and Correlation and Nonparametric. For now, just register the pattern: getting the variable type right is the first step of choosing a valid test, long before any p-value appears.
Check your understanding
Which variable is ordinal (ordered categories with possibly uneven gaps), rather than nominal or numerical?
A user's country of residence
T-shirt size recorded as S, M, L, XL
The exact price paid in dollars
A randomly assigned account number
Why is "20 °C is twice as warm as 10 °C" an incorrect statement, statistically speaking?
Because temperature is nominal and can't be compared at all
Because the values should be averaged first
Because Celsius is an interval scale with an arbitrary zero, so ratios like "twice as warm" aren't meaningful
Because 10 and 20 are too close together to compare
A dataset codes survey answers as 1=Disagree, 2=Neutral, 3=Agree. Someone reports the mean of this column as the headline summary. What's the most accurate critique?
The mean is perfectly valid because the codes are numbers
The column should have been continuous
The codes are ordinal, so the gaps between them may not be equal; the median or the distribution of responses is a more defensible summary than the mean
Nothing is wrong as long as the sample is large
A column store_id has dtype int64 in pandas. What does that tell you about its statistical type?
It confirms the column is numerical and can be averaged
It means the column is ordinal
Almost nothing — dtype is about storage, and an integer column can be a nominal ID where arithmetic is meaningless
It guarantees there are no missing values
Key takeaways
- Every variable is categorical (nominal/ordinal) or numerical (discrete/continuous); the type governs valid summaries, charts, and tests.
- The four measurement scales — nominal, ordinal, interval, ratio — each unlock more operations; interval lacks a true zero (no ratios), ratio has one.
- Don't average labels or ranks: nominal → counts/mode, ordinal → median/mode/distribution, numerical → mean (or median if skewed).
- pandas dtype is not statistical type: numeric-looking IDs, zip codes, and category codes are nominal even when stored as
int64. - Getting the type right is the first step in choosing a valid statistical test.
With variable types straight, you're ready to summarize them honestly. Measures of Center picks the right "typical value" for each type, Visualizing Distributions matches charts to types, and ANOVA and Chi-Square shows how categorical variables get tested for real.
Populations and Samples
The population–sample distinction at the heart of inference — parameters you can't observe, statistics you compute to estimate them, and why bigger samples sharpen the estimate rather than change the target.
Measures of Center
Mean, median, and mode — what each one captures, when each is the honest summary, and why a single "average" can mislead you on skewed data.