SQL for Data Analysis with DuckDB

Analytics vs. Application Development Analytical vs. Transactional Workloads Thinking in Datasets Why DuckDB?

Your First Analytical Query Loading and Generating Data

Profiling a Dataset Filtering and Slicing Sorting and Top-N

Aggregate Functions Grouping Data GROUP BY ALL and Grouping Sets Filtering Groups with HAVING

Why Window Functions Exist Ranking and ROW_NUMBER Running Totals and Moving Averages

Conditional Logic Working with Dates and Time Pivoting and Reshaping

Common Table Expressions Subqueries in Analytics

Building Business Metrics Reproducible Analytical Workflows

GROUP BY ALL and Grouping Sets

DuckDB conveniences that make aggregation faster to write and more powerful — GROUP BY ALL, plus ROLLUP, CUBE, and GROUPING SETS for subtotals and multi-level summaries.

The last page covered GROUP BY as you would write it in any database. DuckDB adds a few conveniences that remove ceremony and unlock multi-level summaries in a single query. This page covers GROUP BY ALL (a daily time-saver) and grouping sets (ROLLUP, CUBE, GROUPING SETS) for subtotals — the kind of thing you would otherwise stitch together with several queries.

`GROUP BY ALL` — stop repeating yourself

A normal grouped query forces you to list the grouping columns twice: once in SELECT, once in GROUP BY. On a wide query that is tedious and error-prone — add a dimension to SELECT and forget to add it to GROUP BY, and you get an error (or worse, a wrong result in databases that allow it).

GROUP BY ALL tells DuckDB: group by every selected column that is not inside an aggregate. You write the dimensions once.

SQL

DuckDB 1.32.0

The result is identical to writing GROUP BY region, product, channel, but the intent is clearer and the query is harder to get wrong. This is one of the small ergonomics that make DuckDB pleasant for fast analysis.

GROUP BY ALL is the non-aggregated columns

DuckDB inspects your SELECT list and groups by exactly the columns that are not wrapped in an aggregate. Aggregated columns (SUM, COUNT, etc.) are summarized; everything else becomes a grouping key.

The subtotal problem

Suppose you have revenue per (region, product) and you also want the revenue per region overall, and the grand total — all in one result. Plain GROUP BY gives only the finest level. Running three separate queries works but is clumsy and hard to combine. Grouping sets solve this in one pass.

`ROLLUP` — hierarchical subtotals

ROLLUP(region, product) produces the detailed rows plus a subtotal per region plus a grand total. It is perfect for hierarchies where each level rolls up into the one above (product → region → everything).

SQL

DuckDB 1.32.0

Notice the rows where product is NULL: those are the region subtotals. The row where both are NULL is the grand total. The NULLs are SQL's way of saying "this column is not part of this subtotal level."

`CUBE` — every combination of subtotals

Where ROLLUP follows one hierarchy, CUBE(region, product) produces subtotals for every combination: per region, per product, and the grand total. Use it when no single hierarchy dominates and you want all the marginal totals.

SQL

DuckDB 1.32.0

You now have per-region subtotals, per-product subtotals, the detail, and the grand total — the full cross-tabulation, in one query.

Distinguishing real NULLs from subtotal NULLs

There is a subtlety: a subtotal row uses NULL to mark "all values," but your data might also contain genuine NULLs. The GROUPING() function returns 1 for a "subtotal" NULL and 0 for a real value, letting you label rows unambiguously.

SQL

DuckDB 1.32.0

Check your understanding

QuestionSelect one

What does GROUP BY ALL do in DuckDB?

Groups by literally every column in the table, even ones not selected.

Groups by every column in the SELECT list that is not inside an aggregate function.

Removes grouping and returns one grand total.

Is a synonym for SELECT DISTINCT.

QuestionSelect one

In a GROUP BY ROLLUP (region, product) result, a row has region = 'West' and product = NULL. What is it?

A data error where the product was lost.

The subtotal of revenue for all products in the West region.

The grand total across all regions.

A duplicate of a detail row.

QuestionSelect one

When would you choose CUBE(region, product) over ROLLUP(region, product)?

When you only want the finest-grained detail rows.

When you want subtotals for every combination — per region, per product, and the grand total — not just one hierarchy.

When you want fewer rows than ROLLUP.

When the columns contain NULLs.

You can now summarize at multiple levels in a single query. But what if you want to filter based on a group's summary — "only regions with revenue over $1000"? That needs HAVING, which is next.

Grouping Data

GROUP BY turns one grand total into one summary per category — the single most-used analytical construct. How grouping works mentally, and how to read a grouped result.

Filtering Groups with HAVING

WHERE filters rows before grouping; HAVING filters groups after aggregating. Why analysts need both, and how to keep them straight.

On this page

GROUP BY ALL — stop repeating yourself The subtotal problem ROLLUP — hierarchical subtotals CUBE — every combination of subtotals Distinguishing real NULLs from subtotal NULLs Check your understanding