Dataslope logoDataslope

The Exploratory Data Analysis Workflow

A repeatable loop for exploring a new dataset with plots — from first glance to insight.

A fresh dataset is a stranger. Before you can trust it, model it, or present it, you have to get to know it — and the fastest way to do that is to look. Exploratory data analysis (EDA) is the disciplined habit of poking at data with summaries and plots to learn its structure, spot its quirks, and let it suggest the questions worth asking.

The statistician John Tukey championed this idea in the 1970s, arguing that data analysis should start with open-ended exploration rather than jumping straight to formal tests. His line is the spirit of this whole page: the greatest value of a picture is when it forces you to notice what you never expected to see.

EDA is not a single command or a checklist you run once. It is a loop. You plot, you notice something, you sharpen your question, and you plot again. This page gives you a practical version of that loop and maps each step to the Seaborn tools you will learn in this course.

A practical 5-step loop

Here is a sequence that works for almost any new dataset. Treat it as a starting groove, not a rigid script — real exploration jumps around, and that is fine.

  1. Understand the structure. How big is it? What are the columns and their types? Where is data missing? Tools: df.shape, df.head(), df.info(), df.isna().sum().
  2. Look at one variable at a time (univariate). What does each column's distribution look like — its center, spread, skew, and any second peak? Tools: displot / histplot, and the distribution chapters.
  3. Compare variables in pairs (bivariate). How do two variables relate, and does the relationship change when you split by a group? Tools: relplot (scatter/line), catplot (categorical), with hue.
  4. Look at many variables at once (multivariate). Which variables move together across the whole table? Tools: pairplot, faceting with col/row, and the correlation heatmap.
  5. Hypothesize, check, and communicate — then loop back. Form a hunch, test it with a targeted plot, and either refine it or move on. When a finding is solid, polish one chart to share it.
StepQuestion it answersSeaborn tools
1. StructureWhat is this table?shape, head, info, isna().sum()
2. UnivariateWhat does each variable look like?displot, histplot
3. BivariateHow do two variables relate?relplot, catplot, hue
4. MultivariateWhat moves together overall?pairplot, facets, heatmap
5. HypothesizeIs my hunch real?a targeted, polished plot

Structure before beauty

Resist the urge to make a gorgeous chart in the first minute. Steps 1 and 2 — sizes, types, and missing values — are unglamorous but they protect you from analyzing the wrong thing. A polished plot of a misread column is worse than no plot at all.

A running example: getting to know penguins

Let's walk the loop on the penguins dataset — 344 penguins measured across three Antarctic islands. We will go just far enough to feel the rhythm of plot, notice, ask again.

Step 1 — Understand the structure

First, the boring-but-essential questions. How many rows and columns? What do the first few rows look like? Which columns have missing values?

Code Block
Python 3.13.2

Already we have learned things no chart would have told us first: there are 344 rows and 7 columns; the four body-measurement columns and sex have a handful of missing values (a couple of penguins were not fully measured). That is a fact to carry forward — Seaborn will quietly drop those rows when it plots, which is usually fine but worth knowing rather than being surprised by.

df.info() in one go

df.info() packs much of step 1 into a single call: it prints each column's name, its non-null count (so you can spot missing data), and its dtype (numeric vs. object/categorical). It is often the very first thing to run on a new DataFrame. Try adding penguins.info() to the block above.

Step 2 — Look at one variable (univariate)

Now zoom into a single column. How is body mass distributed? A histogram bins the values so you can read the shape — where the bulk sits, how wide the spread is, and whether there is more than one peak.

Code Block
Python 3.13.2

Look at the shape and a question writes itself. The distribution is wide and lopsided, and there is a hint of a second bump toward the heavier end — as if the data is really two or three groups stacked on top of each other. Which raises the natural follow-up: grouped by what? That is precisely how EDA is supposed to feel — a plot answers one question and immediately hands you a sharper one.

Step 3 — Compare two variables, split by group (bivariate)

The histogram hinted at hidden groups. Let's test that by relating two measurements — flipper length and body mass — and coloring by species to see whether the groups explain the structure.

Code Block
Python 3.13.2

There it is. Heavier penguins have longer flippers — a clear positive relationship — and the cloud separates into species: Gentoo penguins are the large, long-flippered group sitting up and to the right, which explains the second bump we saw in the mass histogram. One added variable (hue) took us from "there's some structure" to "the structure is species." Notice the arc across these three blocks: a summary raised a question, a univariate plot sharpened it, and a bivariate plot answered it. That arc is the loop.

QuestionSelect one

In the walkthrough, the body-mass histogram looked lopsided with a hint of a second peak, which led us to color a scatter plot by species. What does this sequence illustrate about EDA?

That a single histogram is always enough to fully understand a variable.

That EDA is iterative: one plot reveals something, which sharpens the next question and motivates the next plot.

That you should always start with a scatter plot before anything else.

That coloring by a group is only ever decorative.

Steps 4 and 5 — widen out, then hypothesize

From here the loop keeps turning. Step 4 (multivariate) would widen the view: a pairplot to scatter every pair of measurements at once, or a correlation heatmap to see which variables move together across the whole table — quick ways to find relationships you did not think to look for. We build those in their own chapters.

Step 5 is where exploration turns into a claim. You now have a hypothesis worth stating — "flipper length separates the species more cleanly than bill depth does" — and you would draw one targeted plot to check it, refine the wording, and, once it holds, polish that single chart to communicate the finding. Then you loop back: the answer suggests the next question, and the cycle continues.

The EDA loop

Step back and notice the shape of what we just did. We did not march in a straight line from raw data to a finished answer. We went around a loop, and each lap left us knowing more than the last.

The cycle has four moves that feed into each other, over and over:

  1. Plot. Make a quick, honest chart of what you currently wonder about — a distribution, a relationship, a comparison. Favor speed over polish here; this picture is for you.
  2. Notice. Read the chart actively. A second peak. A cluster. An outlier flung off on its own. A gap where you expected points. A trend that bends. The thing you did not expect is usually the most valuable.
  3. Ask a sharper question. Turn what you noticed into a more specific question. "There's structure" becomes "is this structure explained by species?" Each lap narrows the question.
  4. Plot again. Answer the sharper question with a new chart — often the previous one plus a hue, a col, or a switch of kind. That answer reveals the next thing to notice, and you are back at the top.

You exit the loop not when you run out of plots but when the data stops surprising you and your questions are answered well enough to act on. Only then do you slow down and craft the polished, explanatory chart that carries your finding to other people.

Make the loop cheap

The faster each lap is, the more laps you take, and the more you see. This is exactly why Seaborn's declarative style matters for EDA: one short line per chart means you can ask the next question while the thought is still warm — add a hue, swap a kind, facet by a column, and look again in seconds.

Exploring is not yet confirming

EDA is for generating hypotheses, not proving them. A pattern you spot by eye — especially one you went looking for after seeing the data — can be a real effect or a coincidence of this particular sample. Treat an exploratory finding as a promising lead to confirm with a proper test or fresh data, not as a settled conclusion.

Your turn

Challenge
Python 3.13.2
Run step 1 on penguins

Practice the very first move of the loop: understanding a dataset's structure. Using the penguins dataset, compute two facts and store each in a variable:

  1. n_rows — the number of rows in the dataset (an integer).
  2. missing_sex — the number of missing values in the sex column (an integer).

Hint: df.shape is a (rows, columns) tuple, and df["col"].isna().sum() counts the missing values in a column.

Two quick facts — 344 rows, 11 missing sex values — and you already know more about this table than most people who plot it. That is step 1 doing its quiet, essential job before any chart is drawn.

Check your understanding

QuestionSelect one

Which phrase best describes the overall character of exploratory data analysis?

A one-time report you generate at the very end of a project.

An iterative loop of plotting, noticing something, asking a sharper question, and plotting again.

A fixed checklist of plots you must produce in a strict order every time.

The process of formally proving a hypothesis with a statistical test.

QuestionSelect one

You have just loaded a brand-new dataset you have never seen. According to the workflow on this page, which is the most sensible first move?

Build a polished, publication-ready multi-panel figure.

Immediately run a formal statistical test on two columns.

Inspect its structure — shape, a few rows, column types, and missing values — with tools like shape, head, info, and isna().sum().

Drop every row that contains any missing value, then start plotting.

QuestionSelect one

In the penguins walkthrough, what made the grouped scatter plot (colored by species) such a productive next step after the body-mass histogram?

It used a prettier color palette than the histogram.

It proved beyond doubt that species causes body mass.

It tested the histogram's hint of hidden groups by relating two variables and splitting by species, which explained the second peak.

It replaced the need to ever look at single variables again.

You now have a repeatable way to meet any dataset: check its structure, look at one variable, then two, then many, and let each plot sharpen the next question. With this workflow in hand and the tidy-data and grammar mental models behind you, you are ready to start drawing the chart families themselves.

On this page