Loading Datasets
Reading real CSV, Excel, JSON, and Parquet data into Pandas — including from URLs — and the most common pitfalls that hit you in the first five seconds.
This is where the hands-on work begins. Almost every analysis starts with the same gesture: read a dataset into a DataFrame. Pandas makes that easy in the happy path and adds knobs for the unhappy paths.
The two functions you will use most
pd.read_csv(path_or_url)— the workhorse. CSV is the universal data format.pd.read_excel(path_or_url)— for.xlsxfiles.
Honorable mentions:
pd.read_json— JSON, often after some flattening.pd.read_parquet— Parquet files (a fast columnar format).pd.read_sql— query results from a SQL database.
For the rest of the course we will lean on read_csv because
the data we will use lives on the public internet as CSVs.
Reading from a URL
pd.read_csv accepts a URL exactly like a file path. The bytes
are fetched over HTTPS in the background and parsed.
Two notes:
- Use the raw URL. Browsers will show GitHub HTML pages at
github.com/.../file.csv— that is not the file itself. The raw version starts withraw.githubusercontent.comand returns the actual bytes. - Some hosts don't allow direct browser fetches. For those you can route through Dataslope's CORS proxy. We will see that further down.
Common read_csv options
pd.read_csv has dozens of parameters, but you will use only a
handful in real life:
| Parameter | What it does |
|---|---|
sep | Custom delimiter (default ,). Use "\t" for TSV. |
header | Which row contains the header (default 0). |
names | Use these column names (skip header). |
usecols | Only load these columns. |
dtype | Force types ({"zip": "string"}). |
parse_dates | Parse these columns as dates. |
na_values | Treat these strings as missing (["NA", "-"]). |
skiprows | Skip N rows at the top (or specific rows). |
nrows | Only read N rows (for sampling huge files). |
encoding | Text encoding ("utf-8", "latin-1", ...). |
Let us try a real example with several of these.
Two things to notice:
usecolssaves memory by skipping columns we do not need.dtypewith"category"for repeated string columns saves significant memory and speeds up groupings.
The first three things to do after read_csv
Whenever you load a new dataset, run these three commands before you do anything else.
If any of those three looks wrong — surprising row count, garbled values, wrong types — fix it now. The cost of fixing a loading bug is much lower at this moment than three steps later.
Loading from a non-CORS-friendly URL
Some hosts (including GitHub itself outside raw.githubusercontent .com, plus many private servers) do not include the HTTP
headers that allow a browser to fetch them. Dataslope provides a
CORS proxy you can route through:
When you have a choice, prefer the direct raw.githubuser content.com URL. The proxy is for cases where you do not.
Reading Excel files
.xlsx files often have multiple sheets and mixed content
(headers offset by a few rows, merged cells, etc.). The basic
call:
df = pd.read_excel("file.xlsx", sheet_name="Sales")sheet_name accepts:
- A string (sheet name).
- An integer (sheet index, 0 = first).
- A list of sheet names → returns a dict of DataFrames.
None→ returns all sheets as a dict.
A note on reality: Excel files frequently have a row or two of
title text at the top before the actual header row. Use the
skiprows parameter to jump over them.
Reading JSON
JSON often comes from APIs and is hierarchical, not tabular. Pandas can flatten simple cases:
For deeply nested JSON, use pd.json_normalize(...) — it
recursively flattens nested objects into columns with
dot-separated names.
Defensive reading patterns
A few habits worth building from day one.
Always pin the types you care about
df = pd.read_csv(
path,
dtype={"zip": "string", "phone": "string"},
)This protects you from the leading-zero / scientific-notation problems we saw earlier.
Parse dates explicitly
df = pd.read_csv(path, parse_dates=["order_date", "ship_date"])Strings that look like dates are still strings until you parse them. Date arithmetic on strings does nothing useful.
Mark known missing-value sentinels
Many CSVs use "N/A", "-", "unknown" to mean missing.
Tell Pandas:
df = pd.read_csv(path, na_values=["N/A", "-", "unknown"])Sample huge files first
If a file is 5 GB, do not load it all to see what is in it. Load the first 1000 rows:
preview = pd.read_csv(path, nrows=1000)Inspect, design your real load, then run it.
A loading challenge
Load the HR dataset from this URL:
https://raw.githubusercontent.com/bdi475/datasets/main/HR-dataset-v14.csv
But:
- Only load the columns:
Age,Department,MonthlyIncome,JobSatisfaction,Attrition. - Convert
DepartmentandAttritionto thecategorydtype as you load.
Store the result in a DataFrame called df.
Check your understanding
What is the difference between https://github.com/.../file.csv and https://raw.githubusercontent.com/.../file.csv?
They are identical
The first is faster
The first returns the GitHub HTML page that displays the file; the second returns the file's raw bytes
The second requires authentication
Why use the category dtype on a column like Department?
It looks prettier
Repeated string values are stored once with integer codes, which saves memory and speeds up grouping
It allows the column to hold any type
It is required by Pandas
A CSV column phone contains values like 5551234567. After pd.read_csv you see it has dtype int64. What is the safest fix?
Restart your computer
Convert to float
Pass dtype={"phone": "string"} to pd.read_csv so the values are read as strings from the start
Use .astype("int64") afterwards
When loading a 5 GB CSV that you have never seen before, what is the recommended first move?
Load the whole thing and wait
Open it in Excel
Use pd.read_csv(path, nrows=1000) to load a preview, then plan the real load (dtypes, usecols) based on what you see
Convert it to JSON first
Notebooks and Environments
Conceptually understanding scripts vs notebooks, virtual environments, packages, and reproducibility — even though Dataslope handles the setup for you.
First Look at a Dataset
The five-minute ritual every analyst performs on a new dataset — and the questions to ask before you compute a single number.