Filtering and Slicing
Using WHERE, BETWEEN, IN, and pattern matching to carve a dataset into the exact slice a question needs — and why filtering early is the analyst's habit.
Profiling tells you what the whole dataset looks like. Most questions,
though, are about a part of it: this region, this month, these
categories. Carving out that part is slicing, and the tool is the
WHERE clause you already know — used with an analyst's intent. This page
sharpens your filtering vocabulary and explains why analysts filter as
early as possible.
A slice is a question in disguise
Every business question hides a filter. "How are sales in the West?" means
WHERE region = 'West'. "What happened in Q1?" means a date range.
Learning to hear the filter inside a question is half of analytical SQL.
Ranges with BETWEEN
Numeric and date ranges come up constantly — a price band, a date window,
an age bracket. BETWEEN reads naturally and is inclusive on both
ends.
BETWEEN includes both endpoints
amount BETWEEN 100 AND 200 keeps rows where amount is 100, 200, and
everything in between. If you need to exclude an endpoint, use explicit
>= / < comparisons instead.
Membership with IN
When a column must match one of several values, IN is cleaner than a
chain of ORs.
category IN ('Office', 'Electronics') is exactly category = 'Office' OR category = 'Electronics', but it scales gracefully to a long list.
Text patterns with LIKE and ILIKE
Analysts often filter text by shape rather than exact value: names
starting with "A", emails at a domain, codes containing a substring.
LIKE uses % (any run of characters) and _ (one character). DuckDB's
ILIKE is the case-insensitive version.
Combining conditions — and the danger of AND/OR
Real slices combine conditions. Be careful mixing AND and OR: AND
binds tighter than OR, so without parentheses the query may not mean
what you intended. Make your grouping explicit.
Without the parentheses, amount > 100 AND region = 'West' OR region = 'East' would also return every East order regardless of amount — a
classic, silent analytical bug.
Why analysts filter early
Filtering early is both a performance habit and a clarity habit:
- Less data to process. A slice of 10,000 rows summarizes faster than
a table of 10 million. On a column store, a tight
WHERElets the engine skip whole chunks. - Clearer reasoning. Once the data is reduced to "the rows this question is about," every later step — grouping, joining, averaging — is easier to think about.
The rule of thumb: slice down to what the question is about as soon as you can.
Check your understanding
Which rows does amount BETWEEN 100 AND 200 keep?
Rows where amount is strictly between 100 and 200, excluding both.
Rows where amount is 100, 200, or anything in between (both endpoints included).
Only rows where amount equals exactly 100 or exactly 200.
No rows, because BETWEEN needs three values.
Why is region IN ('West', 'East') preferred over region = 'West' OR region = 'East' for longer lists?
Because IN checks ranges while OR checks equality.
They are equivalent, but IN is more readable and scales cleanly as the list of values grows.
IN is faster because it ignores NULLs differently and returns more rows.
OR cannot be used with text columns.
A query has WHERE amount > 100 AND region = 'West' OR region = 'East' with no parentheses. Why is this risky?
It is a syntax error and will not run.
AND binds tighter than OR, so it returns big West orders plus every East order, which is probably not intended.
It always returns zero rows.
OR is not allowed in a WHERE clause.
You can now carve any slice you need. Often the next move is to sort a slice and look at its top or bottom — the subject of the next page.
Profiling a Dataset
The analyst's first ten minutes with new data — counting rows, checking ranges, measuring distinctness, and finding missing values. DuckDB's SUMMARIZE makes it nearly automatic.
Sorting and Top-N
ORDER BY as an analytical tool — surfacing extremes, ranking results, and answering "top 10" questions, plus DuckDB's QUALIFY for per-group top-N.