The EDA Workflow
A repeatable, opinionated approach to getting to know a new dataset — and why every analyst needs one.
Exploratory Data Analysis (EDA) is the part of analysis where you build intuition about a dataset before answering specific questions. Without intuition you'll ask the wrong questions, miss obvious problems, and trust the wrong numbers.
EDA is less about specific functions and more about a habit. This page outlines a workflow you can apply to any new dataset.
Why a workflow?
Without this scaffolding, beginners jump straight to groupby
and chart-making, only to realize halfway through that one
column was actually full of garbage.
Step 1 — Shape and types
Questions to answer:
- How many rows and columns?
- What are the column types? Any string columns that should be numeric or date?
- Are there obviously-wrong types (e.g.,
objectwhere you expectedfloat64)?
Step 2 — Eyeball the data
head shows the top — but the top is often not representative.
A random sample sometimes catches surprises: a different format
midway through, mysterious sentinel values, NaNs you didn't
know about.
Step 3 — Missing values
For each column with missing data, you'll need to decide later what to do — but you must know it exists now.
Step 4 — Per-column distributions
For numeric columns, look at describe():
For categorical columns, look at value_counts():
Questions:
- Any zero values where there shouldn't be?
- Any extreme min/max suggesting outliers or sentinels?
- Any category with suspiciously many entries (default values?)
- Any category appearing twice with different spelling?
Step 5 — Relationships between columns
Correlations highlight pairs of columns that move together — sometimes useful, sometimes a hint that columns are measuring the same thing.
For categorical-vs-numeric relationships, group:
Step 6 — Write down what you learned
This is the step beginners skip and pros never do. Keep a notebook section called "What I learned from EDA" with bullet points:
- "5% of
emailis missing; mostly in early 2020 rows." - "
countryhas 'USA' and 'United States' — same value, different label." - "Salary is heavily right-skewed — use median, not mean."
- "Three columns are nearly perfectly correlated."
These notes shape every subsequent decision.
EDA never ends
You'll learn new things about the dataset every time you touch it. Your notes should grow over the life of the project.
An EDA checklist
Print it. Tape it to your wall. Use it on every new dataset.
Check your understanding
A friend gives you a CSV they want analysed and asks "What's the average revenue per region?" What should you do first?
Immediately compute and reply
Refuse — too vague
Run an EDA pass — check shape, types, missing values, distinct regions — then compute the answer with confidence
Switch to SQL
Why look at df.sample(5) instead of just df.head()?
It is faster
It uses less memory
head shows only the top rows — random sampling reveals issues that may only occur in middle or end of the file
It returns sorted rows
Two columns have a correlation of 0.98. What's the most likely interpretation?
They cause each other
They are independent
They are measuring something very similar — worth investigating whether one is derived from the other, or whether keeping both adds any information
It is a bug
What's the value of writing down what you learned during EDA?
It is required by Pandas
It is for your manager
The next person to read your notebook (often future-you, six weeks later) will not remember the dataset's quirks — written notes preserve hard-won knowledge
It speeds up the kernel