Data Analysis with Python Pandas

History of Data Spreadsheets and Business Rise of Digital Datasets Birth of Data Science The Python and Pandas Story Spreadsheets vs Code

What Data Analysis Is How Computers See Data Rows and Columns The Analyst Mindset

Python for Analysis Notebooks and Environments

Loading Datasets First Look at a Dataset

DataFrames and Series Indexes and Labels Selecting Data loc vs iloc

Filtering Data Sorting and Ranking Creating New Columns

Aggregation Basics GroupBy Operations

Messy Data Overview Missing Values Duplicates and Inconsistencies String Operations Dates and Times

Concat vs Merge Merging and Joining

Wide vs Long Pivot Tables

The EDA Workflow Statistical Summaries Hypothesis Intuition

Visualization Basics Choosing the Right Chart

Reproducible Analysis Debugging Analysis Code Exporting Cleaned Data Project Organization

The EDA Workflow

A repeatable, opinionated approach to getting to know a new dataset — and why every analyst needs one.

Exploratory Data Analysis (EDA) is the part of analysis where you build intuition about a dataset before answering specific questions. Without intuition you'll ask the wrong questions, miss obvious problems, and trust the wrong numbers.

EDA is less about specific functions and more about a habit. This page outlines a workflow you can apply to any new dataset.

Why a workflow?

Without this scaffolding, beginners jump straight to groupby and chart-making, only to realize halfway through that one column was actually full of garbage.

Step 1 — Shape and types

Code Block

Python 3.13.2

Initialization code (Python)read-only

Questions to answer:

How many rows and columns?
What are the column types? Any string columns that should be numeric or date?
Are there obviously-wrong types (e.g., object where you expected float64)?

Step 2 — Eyeball the data

Code Block

Python 3.13.2

Initialization code (Python)read-only

head shows the top — but the top is often not representative. A random sample sometimes catches surprises: a different format midway through, mysterious sentinel values, NaNs you didn't know about.

Step 3 — Missing values

Code Block

Python 3.13.2

Initialization code (Python)read-only

For each column with missing data, you'll need to decide later what to do — but you must know it exists now.

Step 4 — Per-column distributions

For numeric columns, look at describe():

Code Block

Python 3.13.2

Initialization code (Python)read-only

For categorical columns, look at value_counts():

Code Block

Python 3.13.2

Initialization code (Python)read-only

Questions:

Any zero values where there shouldn't be?
Any extreme min/max suggesting outliers or sentinels?
Any category with suspiciously many entries (default values?)
Any category appearing twice with different spelling?

Step 5 — Relationships between columns

Code Block

Python 3.13.2

Initialization code (Python)read-only

Correlations highlight pairs of columns that move together — sometimes useful, sometimes a hint that columns are measuring the same thing.

For categorical-vs-numeric relationships, group:

Code Block

Python 3.13.2

Initialization code (Python)read-only

Step 6 — Write down what you learned

This is the step beginners skip and pros never do. Keep a notebook section called "What I learned from EDA" with bullet points:

"5% of email is missing; mostly in early 2020 rows."
"country has 'USA' and 'United States' — same value, different label."
"Salary is heavily right-skewed — use median, not mean."
"Three columns are nearly perfectly correlated."

These notes shape every subsequent decision.

EDA never ends

You'll learn new things about the dataset every time you touch it. Your notes should grow over the life of the project.

An EDA checklist

Print it. Tape it to your wall. Use it on every new dataset.

Check your understanding

QuestionSelect one

A friend gives you a CSV they want analysed and asks "What's the average revenue per region?" What should you do first?

Immediately compute and reply

Refuse — too vague

Run an EDA pass — check shape, types, missing values, distinct regions — then compute the answer with confidence

Switch to SQL

QuestionSelect one

Why look at df.sample(5) instead of just df.head()?

It is faster

It uses less memory

head shows only the top rows — random sampling reveals issues that may only occur in middle or end of the file

It returns sorted rows

QuestionSelect one

Two columns have a correlation of 0.98. What's the most likely interpretation?

They cause each other

They are independent

They are measuring something very similar — worth investigating whether one is derived from the other, or whether keeping both adds any information

It is a bug

QuestionSelect one

What's the value of writing down what you learned during EDA?

It is required by Pandas

It is for your manager

The next person to read your notebook (often future-you, six weeks later) will not remember the dataset's quirks — written notes preserve hard-won knowledge

It speeds up the kernel

Pivot Tables

pivot_table — Pandas's answer to spreadsheet pivot tables — with aggregation, multi-level indexes, and totals.

Statistical Summaries

describe, mean, median, std, quantiles, value_counts, and correlations — the everyday vocabulary of summary statistics.

On this page

Why a workflow?Step 1 — Shape and types Step 2 — Eyeball the data Step 3 — Missing values Step 4 — Per-column distributions Step 5 — Relationships between columns Step 6 — Write down what you learned An EDA checklist Check your understanding