Dataslope logoDataslope

Subqueries in Analytics

Scalar, IN, and correlated subqueries — where each fits in analytical work, how they compare to CTEs and joins, and how to choose the clearest tool for the question.

A subquery is a query nested inside another. You met them implicitly in the last page (and saw why deep nesting hurts readability). Used in moderation, though, subqueries are a precise, expressive tool. This page sorts the common kinds, shows where each shines in analysis, and gives you a rule for choosing between a subquery, a CTE, and a join.

Three kinds of subquery

They differ by what they return and how they connect to the outer query. Let us take them in turn.

Scalar subquery: a single value

A scalar subquery returns exactly one row, one column — a single number you can drop into a SELECT or WHERE. It is perfect for comparing each row to a global summary, like "how does this order compare to the overall average?"

SQL
DuckDB 1.32.0

The inner (SELECT AVG(amount) FROM orders) computes one number, reused on every row. (A window AVG(amount) OVER () would also work — often the choice is taste plus clarity.)

IN subquery: a filter list

An IN subquery returns a column of values to filter against. It answers "keep rows whose key appears in this other set" — e.g. "orders from customers in the loyalty program."

SQL
DuckDB 1.32.0

The subquery produces the gold customers' ids; the outer query keeps only matching orders. Read it as "orders whose customer is in the gold set."

Correlated subquery: one per outer row

A correlated subquery references a column from the outer query, so it is conceptually re-evaluated for each outer row. It is expressive — "the average amount for this row's region" — but can be slower, and a window function or join is often clearer.

SQL
DuckDB 1.32.0

Notice WHERE x.region = o.region — the inner query depends on the outer row's region. This works, but AVG(amount) OVER (PARTITION BY region) expresses the same idea more directly and usually runs better. Knowing when not to reach for a correlated subquery is part of the skill.

EXISTS: does any matching row exist?

EXISTS is a correlated subquery that returns true/false — "is there at least one matching row?" It is the natural way to ask "customers who have placed any order" without caring how many.

SQL
DuckDB 1.32.0

Bo has no orders, so Bo is excluded. EXISTS stops at the first match, making it an efficient "has any" test.

Choosing: subquery vs. CTE vs. join

These tools overlap. A rough guide for analytical work:

You want...Reach for
A single value to compare againstScalar subquery or OVER ()
To keep rows matching a set of keysIN subquery or a join
A per-row value from related rowsWindow function (clearest), else correlated subquery
A multi-step pipeline you will read laterCTEs
To combine columns from two tablesJoin

The meta-rule: prefer the clearest tool, not the cleverest. Reach for a CTE when steps pile up; a window function when you need per-row context; a join when you are combining tables; a small subquery when it genuinely reads better.

Check your understanding

QuestionSelect one

What does a scalar subquery return?

A whole table of rows and columns.

Exactly one row and one column — a single value usable in SELECT or WHERE.

A list of values for an IN filter.

Nothing; it only filters.

QuestionSelect one

What makes a subquery correlated?

It returns more than one column.

It references a column from the outer query, so it conceptually re-runs for each outer row.

It is written using a WITH clause.

It always uses EXISTS.

QuestionSelect one

You want each order shown alongside its own region's average amount. Which is usually the clearest tool?

A correlated subquery re-computing the average per row.

A window function: AVG(amount) OVER (PARTITION BY region).

A GROUP BY region query.

An IN subquery.

You can now compose analytical queries from clear building blocks. Time to put everything together toward the real goal — turning data into business insight.

On this page