Dataslope logoDataslope

Grouping Data

GROUP BY turns one grand total into one summary per category — the single most-used analytical construct. How grouping works mentally, and how to read a grouped result.

A grand total answers "how much, overall?" But analysts almost always want more: "how much per region?", "average rating per product?", "orders per day?". That per-something breakdown is GROUP BY, and it is arguably the most-used construct in all of analytical SQL. This page builds a solid mental model of what grouping really does.

From one summary to many

Without GROUP BY, an aggregate collapses the whole table to one row. GROUP BY first sorts the rows into bins by some key, then computes the aggregate once per bin. One total becomes one total per group.

The result has one row per group, with the grouping column(s) plus whatever you aggregated.

SQL
DuckDB 1.32.0

Read the result as a small report: each line is one region, summarized.

The golden rule of GROUP BY

Every column in the SELECT list must be either in the GROUP BY or inside an aggregate function. This rule trips up everyone once, so it is worth understanding why, not just memorizing it.

When you group by region, each result row stands for a whole group of orders. Asking for a bare id makes no sense — which order's id, when the row represents twenty orders? There is no single answer, so SQL refuses. The column must either define the group (region) or be summarized across it (SUM(amount)).

SQL
DuckDB 1.32.0

Grouping by multiple columns

Grouping by two columns creates one row per combination that actually appears. Above, you get one row per (region, product) pair — a small matrix of the business. Adding a grouping column is exactly the drill-down move from earlier: more dimensions, finer detail.

Grouping by an expression

You can group by something computed, not just a stored column — by month extracted from a date, by a price bucket, by the first letter of a name. This is where grouping becomes genuinely creative.

SQL
DuckDB 1.32.0

You invented a dimension (size_bucket) that does not exist in the table, then summarized by it. That is a remarkably powerful idea: grouping keys can be anything you can compute.

How a grouped query is processed

It helps to picture the order of operations: filter, then group, then aggregate, then sort.

WHERE filters individual rows before grouping — it cannot see group totals. Filtering on a group total needs HAVING, which is two pages away.

Check your understanding

QuestionSelect one

What does GROUP BY region do to a table of orders?

It deletes all but one row per region.

It forms one bin per region and computes the aggregates once per bin, yielding one result row per region.

It sorts the orders by region without summarizing.

It returns every order with the region repeated.

QuestionSelect one

In SELECT region, id, SUM(amount) FROM orders GROUP BY region, why is id a problem?

id is a reserved word.

Each result row represents a whole group of orders, so a single id is ambiguous — it is neither grouped nor aggregated.

SUM(amount) is invalid when other columns are present.

You can never select more than two columns with GROUP BY.

QuestionSelect one

What does grouping by region, product (two columns) produce?

One row per region only, ignoring product.

One row per distinct (region, product) combination that appears in the data.

One row per product only, ignoring region.

A syntax error, because GROUP BY allows only one column.

GROUP BY is so central that DuckDB adds conveniences to make it terser and more powerful — GROUP BY ALL and grouping sets. That is the next page.

On this page