Loading Datasets

Reading real CSV, Excel, JSON, and Parquet data into Pandas — including from URLs — and the most common pitfalls that hit you in the first five seconds.

This is where the hands-on work begins. Almost every analysis starts with the same gesture: read a dataset into a DataFrame. Pandas makes that easy in the happy path and adds knobs for the unhappy paths.

The two functions you will use most

pd.read_csv(path_or_url) — the workhorse. CSV is the universal data format.
pd.read_excel(path_or_url) — for .xlsx files.

Honorable mentions:

pd.read_json — JSON, often after some flattening.
pd.read_parquet — Parquet files (a fast columnar format).
pd.read_sql — query results from a SQL database.

For the rest of the course we will lean on read_csv because the data we will use lives on the public internet as CSVs.

Reading from a URL

pd.read_csv accepts a URL exactly like a file path. The bytes are fetched over HTTPS in the background and parsed.

Two notes:

Use the raw URL. Browsers will show GitHub HTML pages at github.com/.../file.csv — that is not the file itself. The raw version starts with raw.githubusercontent.com and returns the actual bytes.
Some hosts don't allow direct browser fetches. For those you can route through Dataslope's CORS proxy. We will see that further down.

Common `read_csv` options

pd.read_csv has dozens of parameters, but you will use only a handful in real life:

Parameter	What it does
`sep`	Custom delimiter (default `,`). Use `"\t"` for TSV.
`header`	Which row contains the header (default `0`).
`names`	Use these column names (skip header).
`usecols`	Only load these columns.
`dtype`	Force types (`{"zip": "string"}`).
`parse_dates`	Parse these columns as dates.
`na_values`	Treat these strings as missing (`["NA", "-"]`).
`skiprows`	Skip N rows at the top (or specific rows).
`nrows`	Only read N rows (for sampling huge files).
`encoding`	Text encoding (`"utf-8"`, `"latin-1"`, ...).

Let us try a real example with several of these.

Two things to notice:

usecols saves memory by skipping columns we do not need.
dtype with "category" for repeated string columns saves significant memory and speeds up groupings.

The first three things to do after `read_csv`

Whenever you load a new dataset, run these three commands before you do anything else.

If any of those three looks wrong — surprising row count, garbled values, wrong types — fix it now. The cost of fixing a loading bug is much lower at this moment than three steps later.

Loading from a non-CORS-friendly URL

Some hosts (including GitHub itself outside raw.githubusercontent .com, plus many private servers) do not include the HTTP headers that allow a browser to fetch them. Dataslope provides a CORS proxy you can route through:

When you have a choice, prefer the direct raw.githubuser content.com URL. The proxy is for cases where you do not.

Reading Excel files

.xlsx files often have multiple sheets and mixed content (headers offset by a few rows, merged cells, etc.). The basic call:

df = pd.read_excel("file.xlsx", sheet_name="Sales")

sheet_name accepts:

A string (sheet name).
An integer (sheet index, 0 = first).
A list of sheet names → returns a dict of DataFrames.
None → returns all sheets as a dict.

A note on reality: Excel files frequently have a row or two of title text at the top before the actual header row. Use the skiprows parameter to jump over them.

Reading JSON

JSON often comes from APIs and is hierarchical, not tabular. Pandas can flatten simple cases:

For deeply nested JSON, use pd.json_normalize(...) — it recursively flattens nested objects into columns with dot-separated names.

Defensive reading patterns

A few habits worth building from day one.

Always pin the types you care about

df = pd.read_csv(
    path,
    dtype={"zip": "string", "phone": "string"},
)

This protects you from the leading-zero / scientific-notation problems we saw earlier.

Parse dates explicitly

df = pd.read_csv(path, parse_dates=["order_date", "ship_date"])

Strings that look like dates are still strings until you parse them. Date arithmetic on strings does nothing useful.

Mark known missing-value sentinels

Many CSVs use "N/A", "-", "unknown" to mean missing. Tell Pandas:

df = pd.read_csv(path, na_values=["N/A", "-", "unknown"])

Sample huge files first

If a file is 5 GB, do not load it all to see what is in it. Load the first 1000 rows:

preview = pd.read_csv(path, nrows=1000)

Inspect, design your real load, then run it.

A loading challenge

Load the HR dataset from this URL:

https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv

But:

Only load the columns: Age, Department, MonthlyIncome, JobSatisfaction, Attrition.
Convert Department and Attrition to the category dtype as you load.

Store the result in a DataFrame called df.

Check your understanding

QuestionSelect one

What is the difference between https://github.com/.../file.csv and https://raw.githubusercontent.com/.../file.csv?

They are identical

The first is faster

The first returns the GitHub HTML page that displays the file; the second returns the file's raw bytes

The second requires authentication

QuestionSelect one

Why use the category dtype on a column like Department?

It looks prettier

Repeated string values are stored once with integer codes, which saves memory and speeds up grouping

It allows the column to hold any type

It is required by Pandas

QuestionSelect one

A CSV column phone contains values like 5551234567. After pd.read_csv you see it has dtype int64. What is the safest fix?

Restart your computer

Convert to float

Pass dtype={"phone": "string"} to pd.read_csv so the values are read as strings from the start

Use .astype("int64") afterwards

QuestionSelect one

When loading a 5 GB CSV that you have never seen before, what is the recommended first move?

Load the whole thing and wait

Open it in Excel

Use pd.read_csv(path, nrows=1000) to load a preview, then plan the real load (dtypes, usecols) based on what you see

Convert it to JSON first

Loading Datasets

On this page