Dataslope logoDataslope

First Look at a Dataset

The five-minute ritual every analyst performs on a new dataset — and the questions to ask before you compute a single number.

You have a fresh DataFrame. The temptation is to dive straight into .mean() and charts. Resist it. Spend five minutes getting to know the dataset first. Those five minutes save hours later.

The five-minute ritual

Every analyst eventually develops a personal version of this ritual. Here is a solid one to start with.

Let us walk each step on a real dataset.

Step 1 — Shape and head

The first question: what does the table look like, and how big is it?

Code Block
Python 3.13.2

What you are looking for:

  • Reasonable row count. Does 1,470 employees match what someone told you to expect? If you expected a million and got a thousand, something was filtered.
  • All the columns you expected. If you were promised manager_id and there is no such column, ask.
  • Sample values that make sense. If Age shows None, "N/A", or a wildly out-of-range value, you know cleaning is in your future.

Step 2 — Dtypes and memory

The second question: did Pandas interpret the columns correctly, and is the dataset going to fit in memory?

Code Block
Python 3.13.2
Initialization code (Python)read-only

df.info(memory_usage="deep") prints a summary including estimated memory in bytes (the "deep" flag inspects each string instead of guessing).

Look for:

  • Strings that should be numbers (e.g. "5,000" with a comma comes in as object).
  • Numbers that should be strings (ZIP codes, phone numbers).
  • Object columns that should be categorical (a small, repeating vocabulary like Department).
  • object columns that contain dates — they need pd.to_datetime.

Step 3 — Describe

describe gives you summary statistics for each column. By default it summarizes numeric columns; pass include="all" to also describe text/categorical columns.

Code Block
Python 3.13.2
Initialization code (Python)read-only

For numeric columns, look at:

  • count — should match len(df) unless there are NaNs.
  • min / max — sanity check the range. Age of -3? Salary of $999,999,999? Bug.
  • mean vs 50% (median) — large differences signal skew or outliers.
  • std — very small std means the column barely varies and may not be useful.

For object columns:

  • unique — number of distinct values. Useful for spotting near-categoricals.
  • top — most common value.
  • freq — how many times the top value appears.

Step 4 — Missing-value map

The fourth question: where is data missing?

Code Block
Python 3.13.2
Initialization code (Python)read-only

The HR dataset is unusually clean — you may see zero missing. Most real datasets have several columns with some missingness. The pattern matters:

  • Random missingness can often be ignored or imputed.
  • Structural missingness (a column that is only populated for some rows by design) tells you something about the dataset's grain.
  • Concentrated missingness (one row is mostly NaN) may indicate a bad row that should be dropped.

We have a whole chapter on this.

Step 5 — Unique-value counts for categoricals

For every text-like column, look at the values:

Code Block
Python 3.13.2
Initialization code (Python)read-only

What to look for:

  • Casing inconsistencies"Sales", "sales", "SALES".
  • Typos"Engineering", "Enginering".
  • Near-duplicates"USA", "U.S.", "United States".
  • Surprising rare values — a single row with department "???" is probably a data-entry mistake.
  • Cardinality — if every row has a unique value (like a name), this is an identifier column, not a category.

A second-look ritual: pairwise interactions

After the five-minute ritual, do a few quick cross-tabulations.

Code Block
Python 3.13.2
Initialization code (Python)read-only

These take a minute and often reveal the structure of the dataset (e.g., "Each JobRole belongs to exactly one department — so they're a hierarchy"). That kind of insight changes how you think about all subsequent analysis.

Document what you find

A common professional habit: keep a short Markdown note (in the notebook itself or in a separate file) capturing your findings from the five-minute ritual.

Dataset: HR-dataset-v14.csv. 1,470 rows × 35 cols. Each row is one employee. No missing values. Department has 3 values (R&D, Sales, HR). JobRole (9 values) is nested within Department. Attrition is "Yes" / "No", ~16% attrition overall.

The next person who picks up your notebook will thank you. The next person is often you in three months.

The single most under-rated skill

The analysts who get promoted are usually the ones who can quickly summarize "what is in this dataset?" in plain English. The technical skills come naturally with practice; the habit of looking before computing is much rarer.

A guided exploration challenge

Challenge
Python 3.13.2
First-look summary

Load this HR dataset:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

Then compute four values and assign them to specific variables:

  1. n_rows — total number of employees
  2. n_cols — total number of columns
  3. n_missing_total — total count of NaN cells across the whole DataFrame (sum over all columns)
  4. departments — a sorted list of unique values in the Department column

Use Pandas, not hard-coded numbers.

Check your understanding

QuestionSelect one

Why does the chapter recommend running df.describe() before any "real" analysis?

To make the dataset look bigger

Because it gives a quick read on ranges, central tendency, and spread — and surfaces obviously wrong values (negative ages, impossible maxes) early

It is required by Pandas

It removes missing values

QuestionSelect one

In the missing-value step, why is the pattern of missingness as important as the count?

It is not

Because random missingness can often be ignored or imputed, structural missingness reveals something about the dataset's grain, and concentrated missingness flags bad rows — each pattern suggests a different fix

It changes how Pandas computes means

Pandas requires you to look at the pattern

QuestionSelect one

A column with 1,470 rows shows 1,470 unique values when you call .value_counts(). What is it most likely to be?

A categorical variable

A boolean column

An identifier column (like an employee ID or name) — every row has a different value, so it is unique per row

A date column

QuestionSelect one

Which of these is the chapter's recommended habit after completing the five-minute ritual?

Delete the dataset

Switch to Excel

Write a short Markdown note summarizing what you learned (rows, grain, missingness, surprises) so you and others can find it later

Run all the visualizations

On this page