First Look at a Dataset
The five-minute ritual every analyst performs on a new dataset — and the questions to ask before you compute a single number.
You have a fresh DataFrame. The temptation is to dive straight
into .mean() and charts. Resist it. Spend five minutes
getting to know the dataset first. Those five minutes save hours
later.
The five-minute ritual
Every analyst eventually develops a personal version of this ritual. Here is a solid one to start with.
Let us walk each step on a real dataset.
Step 1 — Shape and head
The first question: what does the table look like, and how big is it?
What you are looking for:
- Reasonable row count. Does 1,470 employees match what someone told you to expect? If you expected a million and got a thousand, something was filtered.
- All the columns you expected. If you were promised
manager_idand there is no such column, ask. - Sample values that make sense. If
AgeshowsNone,"N/A", or a wildly out-of-range value, you know cleaning is in your future.
Step 2 — Dtypes and memory
The second question: did Pandas interpret the columns correctly, and is the dataset going to fit in memory?
df.info(memory_usage="deep") prints a summary including
estimated memory in bytes (the "deep" flag inspects each
string instead of guessing).
Look for:
- Strings that should be numbers (e.g.
"5,000"with a comma comes in asobject). - Numbers that should be strings (ZIP codes, phone numbers).
- Object columns that should be categorical (a small,
repeating vocabulary like
Department). objectcolumns that contain dates — they needpd.to_datetime.
Step 3 — Describe
describe gives you summary statistics for each column. By
default it summarizes numeric columns; pass include="all" to
also describe text/categorical columns.
For numeric columns, look at:
count— should matchlen(df)unless there are NaNs.min/max— sanity check the range. Age of -3? Salary of $999,999,999? Bug.meanvs50%(median) — large differences signal skew or outliers.std— very small std means the column barely varies and may not be useful.
For object columns:
unique— number of distinct values. Useful for spotting near-categoricals.top— most common value.freq— how many times the top value appears.
Step 4 — Missing-value map
The fourth question: where is data missing?
The HR dataset is unusually clean — you may see zero missing. Most real datasets have several columns with some missingness. The pattern matters:
- Random missingness can often be ignored or imputed.
- Structural missingness (a column that is only populated for some rows by design) tells you something about the dataset's grain.
- Concentrated missingness (one row is mostly NaN) may indicate a bad row that should be dropped.
We have a whole chapter on this.
Step 5 — Unique-value counts for categoricals
For every text-like column, look at the values:
What to look for:
- Casing inconsistencies —
"Sales","sales","SALES". - Typos —
"Engineering","Enginering". - Near-duplicates —
"USA","U.S.","United States". - Surprising rare values — a single row with department
"???"is probably a data-entry mistake. - Cardinality — if every row has a unique value (like a name), this is an identifier column, not a category.
A second-look ritual: pairwise interactions
After the five-minute ritual, do a few quick cross-tabulations.
These take a minute and often reveal the structure of the dataset (e.g., "Each JobRole belongs to exactly one department — so they're a hierarchy"). That kind of insight changes how you think about all subsequent analysis.
Document what you find
A common professional habit: keep a short Markdown note (in the notebook itself or in a separate file) capturing your findings from the five-minute ritual.
Dataset: HR-dataset-v14.csv. 1,470 rows × 35 cols. Each row is one employee. No missing values.
Departmenthas 3 values (R&D, Sales, HR).JobRole(9 values) is nested withinDepartment.Attritionis"Yes"/"No", ~16% attrition overall.
The next person who picks up your notebook will thank you. The next person is often you in three months.
The single most under-rated skill
The analysts who get promoted are usually the ones who can quickly summarize "what is in this dataset?" in plain English. The technical skills come naturally with practice; the habit of looking before computing is much rarer.
A guided exploration challenge
Load this HR dataset:
https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv
Then compute four values and assign them to specific variables:
n_rows— total number of employeesn_cols— total number of columnsn_missing_total— total count of NaN cells across the whole DataFrame (sum over all columns)departments— a sorted list of unique values in theDepartmentcolumn
Use Pandas, not hard-coded numbers.
Check your understanding
Why does the chapter recommend running df.describe() before any "real" analysis?
To make the dataset look bigger
Because it gives a quick read on ranges, central tendency, and spread — and surfaces obviously wrong values (negative ages, impossible maxes) early
It is required by Pandas
It removes missing values
In the missing-value step, why is the pattern of missingness as important as the count?
It is not
Because random missingness can often be ignored or imputed, structural missingness reveals something about the dataset's grain, and concentrated missingness flags bad rows — each pattern suggests a different fix
It changes how Pandas computes means
Pandas requires you to look at the pattern
A column with 1,470 rows shows 1,470 unique values when you call .value_counts(). What is it most likely to be?
A categorical variable
A boolean column
An identifier column (like an employee ID or name) — every row has a different value, so it is unique per row
A date column
Which of these is the chapter's recommended habit after completing the five-minute ritual?
Delete the dataset
Switch to Excel
Write a short Markdown note summarizing what you learned (rows, grain, missingness, surprises) so you and others can find it later
Run all the visualizations
Loading Datasets
Reading real CSV, Excel, JSON, and Parquet data into Pandas — including from URLs — and the most common pitfalls that hit you in the first five seconds.
DataFrames and Series
Pandas's two core data structures — a labeled 2-D table and a labeled 1-D column — and the deep symmetry between them.