Types of Data

Why a variable's type decides which summaries, charts, and tests are valid — categorical vs numerical, the four measurement scales, and the encoding traps that make people average things that can't be averaged.

Before you compute a single statistic, you have to answer a question that sounds trivial but quietly governs everything: what kind of data is this? The type of a variable decides which summaries make sense, which charts are honest, and which tests are even legal. Take the mean of the wrong kind of column and you'll get a number — a clean, confident, completely meaningless number.

The classic example: a survey stores favorite-flavor responses as 1 = vanilla, 2 = chocolate, 3 = strawberry. Average them and you might get 1.8. There is no flavor 1.8. The arithmetic ran fine; the meaning is nonsense, because those numbers are labels, not quantities. This page is about telling those apart — by eye and in pandas — so you never ship a 1.8-flavored conclusion.

Why this page comes before the statistics

Every later technique assumes you've classified your variables correctly. Measures of Center asks "mean or mode?" — the answer is a type question. Visualizing Distributions asks "histogram or bar chart?" — also a type question. Pick the right type once and a dozen downstream choices become obvious.

The big split: categorical vs. numerical

At the top level, every variable is either categorical (it labels which group something belongs to) or numerical (it measures how much or how many). Each splits once more:

Nominal — pure labels with no inherent order: country, payment method, color, user_id. You can count how many fall in each category, but "greater than" is meaningless.
Ordinal — categories with a meaningful order but unknown, uneven gaps: shirt sizes S < M < L, a 1–5 satisfaction rating, education level. You can rank them, but the distance between "satisfied" and "very satisfied" isn't guaranteed equal to the distance between "neutral" and "satisfied."
Discrete numerical — counts you get by tallying whole things: number of logins, defects per batch, children per household. Usually non-negative integers.
Continuous numerical — measurements that can take any value in a range (limited only by instrument precision): height, temperature, price, response time.

A fast field test

Ask two questions. (1) Can you order the values? No → nominal. (2) If ordered, are the gaps equal and is arithmetic meaningful? No → ordinal. Yes → numerical (then ask if it's a count → discrete, or a measurement → continuous). Most classification mistakes come from skipping question 2 and treating ordered labels as real numbers.

The four measurement scales (the intuition)

A slightly finer lens, due to S.S. Stevens, splits data into four measurement scales. You don't need the theory — you need the practical rule each scale implies: which operations are meaningful.

Scale	What it adds	Example	Mean OK?	"True zero"?
Nominal	Names only	eye color, country	No	No
Ordinal	Order	S/M/L, Likert 1–5	No (use median/mode)	No
Interval	Equal gaps	temperature in °C, calendar year	Yes (differences)	No
Ratio	True zero	height, price, count, duration	Yes (and ratios)	Yes

The ladder is cumulative: each scale keeps the powers of the ones above it and adds one. The two boundaries that trip people up:

Ordinal → interval. Ordinal tells you the order but not the spacing. Averaging it pretends the gaps are equal when they aren't.
Interval → ratio. Interval has equal gaps but no true zero, so ratios are meaningless. 20 °C is not "twice as hot" as 10 °C (the zero is arbitrary). Ratio scales have a real zero, so "twice as long" or "half the price" are valid.

Common misconception: treating ordinal as interval

The single most common scale error in data work is averaging an ordinal variable. A 1–5 star rating looks numeric, so people report "average rating 4.2." But the gap from 1 to 2 may not equal the gap from 4 to 5, so the mean is built on an assumption you can't justify. The honest summaries are the median, the mode, and the full distribution of counts (how many 1s, 2s, …, 5s). Means of Likert items are common in practice and sometimes defensible, but treat them as a convenient approximation, not a rigorous summary — and never for nominal codes.

Choosing summaries by type

The payoff of all this taxonomy is a simple lookup: once you know the type, the right summary and chart almost pick themselves.

Categorical (nominal or ordinal): count categories with value_counts(); report the mode (most frequent) and proportions; visualize with a bar chart. For ordinal, the median category is also meaningful.
Numerical: report center (mean if roughly symmetric, median if skewed) and spread (standard deviation or IQR); visualize with a histogram or box plot. We dig into these choices in Measures of Center and Visualizing Distributions.

Notice that user_id is stored as an integer, so pandas reports its dtype as int64 — but it is nominal: the IDs are names that happen to look like numbers. The next section is about exactly that trap.

Encoding traps: when numbers aren't quantities

The most dangerous columns are the ones that look numerical but aren't. pandas will happily compute df["zip_code"].mean() and hand you a number. The dtype is int64; the meaning is gibberish.

The averaged zip code (~48,705) isn't a place; it's the centroid of four arbitrary labels. The rating mean is less wrong but still rests on assuming equal spacing. Only price_usd — a true ratio quantity — earns its mean.

The numeric-looking ID trap

If a column's numbers are identifiers or codes — user_id, zip_code, store_number, product_sku, encoded categories like 1=vanilla — then arithmetic on them is meaningless no matter what the dtype says. A reliable tell: would value + value or value / 2 mean anything? For a zip code or a flavor code, no. Cast these to category (or str) so you don't accidentally average a label. The computer can't tell a quantity from a code — that judgment is yours.

QuestionSelect one

A postal_code column is stored as integers. A teammate computes its mean to "find the typical region." What's the problem?

The mean is fine; it gives the geographic center

Postal codes should be floats, not integers

Postal code is nominal (a label that happens to look numeric), so averaging it is meaningless; count or mode is appropriate

The mean is only valid if the codes are sorted first

pandas dtypes vs. statistical type

A crucial habit: a pandas dtype is not a statistical type. The dtype tells you how the bytes are stored (int64, float64, object, category); the statistical type is a judgment about meaning that only you can make. They often disagree.

Statistical type	Typical pandas dtype	But watch out
Nominal	`object`, `category`	…or `int64` for ID/zip codes
Ordinal	`category` (ordered)	…often a bare `int64` for ratings
Discrete numeric	`int64`	…genuinely counts, arithmetic OK
Continuous numeric	`float64`	usually trustworthy

Two practical moves. First, convert genuine categoricals to the category dtype — it saves memory and signals intent. Second, for ordinal data, make an ordered categorical so pandas knows S < M < L, which unlocks a correct .median() and proper sorting.

Likert scales: the gray zone

Survey scales ("Strongly disagree" … "Strongly agree", coded 1–5) are ordinal. Purists summarize them with the median, mode, and the distribution of responses. In practice, analysts very often average Likert items (and treat sums of many items as roughly interval) — it's a widespread, useful approximation. Just know you're assuming equal spacing when you do, report it honestly, and never extend the habit to nominal codes, where a mean has no meaning at all.

Classify and summarize correctly

A DataFrame df has several columns. Build a dict named types mapping each column name to its statistical type as one of these exact lowercase strings:

"nominal" — unordered label (including numeric-looking IDs/codes)
"ordinal" — ordered categories with uneven gaps
"discrete" — countable whole-number quantity
"continuous" — measured quantity that can take any value in a range

Classify by meaning, not by dtype. The columns are:

customer_id (an identifier), country (a label), satisfaction (1–5 rating), num_purchases (a count), account_balance (a dollar amount).

types must have exactly those 5 keys.

You're given three columns of different types. Compute the type-appropriate summary for each into a dict named summary:

"country_mode" — the mode (most common value) of the nominal country column, as a string. Use df["country"].mode()[0].
"rating_median" — the median of the ordinal rating column, as a float.
"price_mean" — the mean of the continuous price column, as a float.

The point: mode for nominal, median for ordinal, mean for continuous — not a mean for all three.

The reflex to build

Before you call .mean() on any column, ask: is this column a quantity, or a label/rank wearing a number costume? Quantities (discrete, continuous, ratio/interval) can be averaged. Labels (nominal) and ranks (ordinal) want counts, modes, and medians instead.

Why type also decides which tests are valid

This page is the gateway to the inference chapters, because the type of your variables decides which statistical test is appropriate — not just which summary. A quick preview of where this leads:

Comparing a numerical outcome across two groups → a t-test (see T-Tests).
Comparing a numerical outcome across three or more groups → ANOVA.
Testing whether two categorical variables are associated → a chi-square test.
Measuring the relationship between two numerical variables → correlation.

We cover these in ANOVA and Chi-Square and Correlation and Nonparametric. For now, just register the pattern: getting the variable type right is the first step of choosing a valid test, long before any p-value appears.

Check your understanding

QuestionSelect one

Which variable is ordinal (ordered categories with possibly uneven gaps), rather than nominal or numerical?

A user's country of residence

T-shirt size recorded as S, M, L, XL

The exact price paid in dollars

A randomly assigned account number

QuestionSelect one

Why is "20 °C is twice as warm as 10 °C" an incorrect statement, statistically speaking?

Because temperature is nominal and can't be compared at all

Because the values should be averaged first

Because Celsius is an interval scale with an arbitrary zero, so ratios like "twice as warm" aren't meaningful

Because 10 and 20 are too close together to compare

QuestionSelect one

A dataset codes survey answers as 1=Disagree, 2=Neutral, 3=Agree. Someone reports the mean of this column as the headline summary. What's the most accurate critique?

The mean is perfectly valid because the codes are numbers

The column should have been continuous

The codes are ordinal, so the gaps between them may not be equal; the median or the distribution of responses is a more defensible summary than the mean

Nothing is wrong as long as the sample is large

QuestionSelect one

A column store_id has dtype int64 in pandas. What does that tell you about its statistical type?

It confirms the column is numerical and can be averaged

It means the column is ordinal

Almost nothing — dtype is about storage, and an integer column can be a nominal ID where arithmetic is meaningless

It guarantees there are no missing values

Key takeaways

Every variable is categorical (nominal/ordinal) or numerical (discrete/continuous); the type governs valid summaries, charts, and tests.
The four measurement scales — nominal, ordinal, interval, ratio — each unlock more operations; interval lacks a true zero (no ratios), ratio has one.
Don't average labels or ranks: nominal → counts/mode, ordinal → median/mode/distribution, numerical → mean (or median if skewed).
pandas dtype is not statistical type: numeric-looking IDs, zip codes, and category codes are nominal even when stored as int64.
Getting the type right is the first step in choosing a valid statistical test.

With variable types straight, you're ready to summarize them honestly. Measures of Center picks the right "typical value" for each type, Visualizing Distributions matches charts to types, and ANOVA and Chi-Square shows how categorical variables get tested for real.

Types of Data

On this page