Grouping Data
GROUP BY turns one grand total into one summary per category — the single most-used analytical construct. How grouping works mentally, and how to read a grouped result.
A grand total answers "how much, overall?" But analysts almost always
want more: "how much per region?", "average rating per product?",
"orders per day?". That per-something breakdown is GROUP BY, and
it is arguably the most-used construct in all of analytical SQL. This page
builds a solid mental model of what grouping really does.
From one summary to many
Without GROUP BY, an aggregate collapses the whole table to one row.
GROUP BY first sorts the rows into bins by some key, then computes
the aggregate once per bin. One total becomes one total per group.
The result has one row per group, with the grouping column(s) plus whatever you aggregated.
Read the result as a small report: each line is one region, summarized.
The golden rule of GROUP BY
Every column in the SELECT list must be either in the GROUP BY or
inside an aggregate function. This rule trips up everyone once, so it
is worth understanding why, not just memorizing it.
When you group by region, each result row stands for a whole group of
orders. Asking for a bare id makes no sense — which order's id, when
the row represents twenty orders? There is no single answer, so SQL
refuses. The column must either define the group (region) or be
summarized across it (SUM(amount)).
Grouping by multiple columns
Grouping by two columns creates one row per combination that actually
appears. Above, you get one row per (region, product) pair — a small
matrix of the business. Adding a grouping column is exactly the
drill-down move from earlier: more dimensions, finer detail.
Grouping by an expression
You can group by something computed, not just a stored column — by month extracted from a date, by a price bucket, by the first letter of a name. This is where grouping becomes genuinely creative.
You invented a dimension (size_bucket) that does not exist in the
table, then summarized by it. That is a remarkably powerful idea: grouping
keys can be anything you can compute.
How a grouped query is processed
It helps to picture the order of operations: filter, then group, then aggregate, then sort.
WHERE filters individual rows before grouping — it cannot see group
totals. Filtering on a group total needs HAVING, which is two pages
away.
Check your understanding
What does GROUP BY region do to a table of orders?
It deletes all but one row per region.
It forms one bin per region and computes the aggregates once per bin, yielding one result row per region.
It sorts the orders by region without summarizing.
It returns every order with the region repeated.
In SELECT region, id, SUM(amount) FROM orders GROUP BY region, why is id a problem?
id is a reserved word.
Each result row represents a whole group of orders, so a single id is ambiguous — it is neither grouped nor aggregated.
SUM(amount) is invalid when other columns are present.
You can never select more than two columns with GROUP BY.
What does grouping by region, product (two columns) produce?
One row per region only, ignoring product.
One row per distinct (region, product) combination that appears in the data.
One row per product only, ignoring region.
A syntax error, because GROUP BY allows only one column.
GROUP BY is so central that DuckDB adds conveniences to make it terser
and more powerful — GROUP BY ALL and grouping sets. That is the next
page.
Aggregate Functions
The heart of analytics — collapsing many rows into a single summary number. COUNT, SUM, AVG, MIN, MAX, and why aggregation is how analysts make big data comprehensible.
GROUP BY ALL and Grouping Sets
DuckDB conveniences that make aggregation faster to write and more powerful — GROUP BY ALL, plus ROLLUP, CUBE, and GROUPING SETS for subtotals and multi-level summaries.