Profiling a Dataset
The analyst's first ten minutes with new data — counting rows, checking ranges, measuring distinctness, and finding missing values. DuckDB's SUMMARIZE makes it nearly automatic.
When an analyst opens an unfamiliar dataset, they do not start with the interesting business question. They start by profiling it: a few quick queries that reveal the data's size, range, distinctness, and gaps. Skipping this step is how analysts get fooled — by duplicates, missing values, or a date column that secretly stops in March. This page is the profiling checklist, expressed in SQL.
Why profile first
A dataset can lie to you in quiet ways:
- It is smaller or larger than you assumed.
- A column has far fewer distinct values than expected (or far more).
- Some values are missing (
NULL), silently dropping rows from averages. - The range is wrong — dates that stop early, amounts that go negative.
Profiling surfaces these before they corrupt a conclusion. Think of it as checking the ingredients before you cook.
Size and range
The first two questions are always how many rows and what does the key
column span. COUNT(*), MIN, and MAX answer both at once.
In seconds you learn the table's size and the window of time it covers — without scrolling a single row.
Distinctness: COUNT(DISTINCT ...)
How many different values does a column hold? This reveals whether a
column is a category (a few values), an identifier (all unique), or
something in between. The gap between COUNT(*) and COUNT(DISTINCT id)
also reveals duplicates.
If distinct_ids were less than rows, you would have duplicate ids — a
red flag worth investigating before trusting any per-id summary.
Missing values: counting NULLs
COUNT(column) ignores NULLs, while COUNT(*) counts every row. The
difference is exactly the number of missing values — a one-line null
check.
Knowing how many values are missing tells you whether a column is trustworthy or needs cleaning before analysis.
The shortcut: SUMMARIZE
DuckDB bundles the whole profiling checklist into one keyword:
SUMMARIZE. Prefix any table or query with it and DuckDB returns, per
column, the count, number of distinct values, null percentage, min, max,
average, and approximate quantiles.
One line, a complete profile. SUMMARIZE is the first thing many DuckDB
analysts type when they meet a new table — it answers size, range,
distinctness, and nulls for every column simultaneously.
The profiling checklist
Before analyzing any new dataset, answer:
- How many rows? (
COUNT(*)) - What range does the key/date span? (
MIN,MAX) - How distinct is each important column? (
COUNT(DISTINCT ...)) - Are there duplicates? (
COUNT(*)vsCOUNT(DISTINCT id)) - What is missing? (
COUNT(*) - COUNT(column))
Or just run SUMMARIZE and read the lot at a glance.
Check your understanding
A table has COUNT(*) = 1000 but COUNT(DISTINCT id) = 980. What does this most likely indicate?
The id column has 20 NULL values.
There are duplicate id values — 1000 rows but only 980 unique ids.
The query is invalid.
The table is sorted incorrectly.
How do you count the missing values in a column email?
COUNT(email) gives the number of missing values directly.
COUNT(*) - COUNT(email), because COUNT(*) counts all rows and COUNT(email) skips NULLs.
SUM(email), which adds up the NULLs.
It is impossible to count NULLs in SQL.
What does DuckDB's SUMMARIZE keyword do?
It deletes summary rows from a table.
It returns a per-column profile — count, distinct values, null percentage, min, max, average, and quantiles — for a table or query.
It permanently sorts the table by every column.
It is an alias for GROUP BY.
With a trustworthy mental model in hand, you can start carving the data into the slices you care about — which is exactly the next page.
Loading and Generating Data
Where analytical data comes from — tables, literal VALUES, files, and DuckDB's range() generator — plus CREATE TABLE AS, the analyst's favorite way to capture a result.
Filtering and Slicing
Using WHERE, BETWEEN, IN, and pattern matching to carve a dataset into the exact slice a question needs — and why filtering early is the analyst's habit.