SQL for Data Analysis with DuckDB

Analytics vs. Application Development Analytical vs. Transactional Workloads Thinking in Datasets Why DuckDB?

Your First Analytical Query Loading and Generating Data

Profiling a Dataset Filtering and Slicing Sorting and Top-N

Aggregate Functions Grouping Data GROUP BY ALL and Grouping Sets Filtering Groups with HAVING

Why Window Functions Exist Ranking and ROW_NUMBER Running Totals and Moving Averages

Conditional Logic Working with Dates and Time Pivoting and Reshaping

Common Table Expressions Subqueries in Analytics

Building Business Metrics Reproducible Analytical Workflows

Reproducible Analytical Workflows

Why an analysis must give the same answer twice, and how to build queries that do — deterministic ordering, idempotent setup, views, and SQL as the documentation of your logic.

A number that cannot be reproduced cannot be trusted. If you run the same analysis tomorrow — or a colleague runs it on their machine — it should produce the same result, and you should be able to explain exactly how it was computed. This closing-technique page is about reproducibility: the habits that turn a one-off query into analysis others can rely on.

Why reproducibility is non-negotiable

Analysis drives decisions, and decisions get questioned. "Where did this $2M figure come from?" is a question you must be able to answer months later. Reproducible work means:

The same inputs always give the same outputs.
Anyone can re-run the analysis and get your number.
The logic is readable, so the why is recoverable, not just the what.

SQL is unusually good at this: a query is simultaneously the computation and its own documentation. The trick is writing it so that property actually holds.

Determinism: make ordering and ties stable

The quietest reproducibility bug is non-deterministic ordering. If two rows tie on the sort key, SQL may return them in either order — so a "top 10" can change between runs even on identical data. Always add a tiebreaker to make the order total and stable.

SQL

DuckDB 1.32.0

Without , id ASC, the three products tied at 100 could appear in any order, and LIMIT 3 might return a different set each run. The tiebreaker makes "top 3" a single, defensible answer.

Idempotent setup: safe to re-run

Reproducible analyses often (re)build their inputs. Setup steps should be idempotent — running them twice does no harm. DuckDB's CREATE OR REPLACE and CREATE ... IF NOT EXISTS make scripts safe to re-run without "already exists" errors.

SQL

DuckDB 1.32.0

Because the table is replaced, re-running the analysis from scratch always yields the same clean_orders — no leftover state, no surprises.

Views: name a query so it stays consistent

A view is a saved query that behaves like a table. Define a metric once as a view, and everyone who reads from it gets the same definition — no copy-pasted SQL drifting out of sync. Views are how teams keep a metric's meaning consistent.

SQL

DuckDB 1.32.0

The definition of "paid revenue by region" now lives in one place. If the business redefines it, you change the view once and every downstream query follows.

Structure for readability and auditability

Reproducibility is as much about humans re-running the logic as machines. The composition habits from earlier pay off here:

Build with CTEs. A named pipeline documents each step, so a reviewer can audit the logic stage by stage.
Comment the intent, not the syntax. Say why a filter exists, not what WHERE does.
Name things honestly. paying_users, not cnt2. The query should read like a description of the analysis.

Put together, these habits mean your analysis is not a lucky result you got once — it is a repeatable, explainable process.

Check your understanding

QuestionSelect one

Why add a tiebreaker like ORDER BY sales DESC, id ASC to a top-N query?

It makes the query run faster.

Without it, rows tied on sales can appear in any order, so the top-N (and thus the result) may change between runs.

It removes duplicate rows.

It converts sales to a percentage.

QuestionSelect one

What does it mean for a setup step to be idempotent, and how does CREATE OR REPLACE TABLE help?

It means the step runs only once and then locks the table.

Running it multiple times has the same effect as running it once; CREATE OR REPLACE rebuilds the table cleanly without "already exists" errors.

It means the table can never be changed again.

It deletes the database after running.

QuestionSelect one

How does defining a metric as a view support reproducibility across a team?

It makes the metric impossible to change.

Everyone queries the same named definition, so the metric's logic lives in one place instead of being copy-pasted and drifting out of sync.

It caches the result so it never recomputes.

It hides the SQL from analysts.

That completes the analytical toolkit — from mindset to metrics to reproducible workflows. Let us wrap up and look at where to go next.

Building Business Metrics

Turning raw rows into the numbers a business actually decides on — conversion rates, retention, revenue per user — and the analytical reasoning that keeps a metric honest.

Next Steps

A recap of the analytical mindset and toolkit you built, one last query that ties it together, and where to go next on your data analysis journey.

On this page

Why reproducibility is non-negotiable Determinism: make ordering and ties stable Idempotent setup: safe to re-run Views: name a query so it stays consistent Structure for readability and auditability Check your understanding