How Computers See Data
A gentle look under the hood — bits, bytes, files, types, and the long chain that turns a CSV on disk into rows and columns in Pandas.
You can use Pandas productively for years without understanding how the bits get from disk to memory. But a little mental model of what is going on underneath will save you from mysterious bugs — why is this column a string when I expected a number? — and it will make the rest of the course feel less like magic.
A computer only knows numbers
At the lowest level, every byte your computer stores is a number between 0 and 255. There are no strings, no dates, no booleans, no DataFrames — just bytes.
The trick is interpretation. The same eight bits can mean a small integer, a character of text, a fraction of a pixel color, or a step in an audio waveform. The interpretation is imposed by the program that reads them.
Files on disk are long sequences of bytes plus a tiny bit of metadata saying what kind of bytes they are (the file extension, sometimes a "magic number" at the start). The sequence does not know it is a CSV until something reads it as one.
A CSV is plain text with commas
The simplest data format in the world: each line is a row, each value is separated by a comma, the first line is often a header.
name,age,city
Aiko,29,Tokyo
Bilal,42,Karachi
Chen,35,ShanghaiThat is literally a complete dataset. You could write it in Notepad. Open it in Excel and it looks like a table. Open it in Pandas and it looks like a DataFrame. The bytes on disk are identical.
Why CSV is everywhere
CSV's strengths: human-readable, tiny dependency footprint, portable across every operating system and language since 1972. Its weaknesses: no type information (everything is text until a reader guesses), bad with commas inside values (which need quoting), and no schema. Despite all of that, it is the most widely exchanged tabular data format in the world.
Types — what a byte "is"
When Pandas reads the CSV above, it does not just store characters. It looks at each column and guesses a type:
name→ text (string).age→ small integer.city→ text.
This is called type inference, and it is the single most common source of frustration for new Pandas users. Pandas guesses well most of the time, but when it guesses wrong, silent bugs follow.
The main types you will encounter:
| Pandas dtype | What it represents | Example values |
|---|---|---|
int64 | Whole numbers | 1, 42, -7 |
float64 | Decimal numbers (and NaNs) | 3.14, 1.5e9, NaN |
bool | True/False | True, False |
object | Strings, or mixed types | "Aiko", "Tokyo" |
datetime64 | Timestamps | 2024-03-18 14:00:00 |
category | A fixed set of repeated labels | "East", "West" |
Let us see them in action.
Notice that joined came in as object (a plain string), not
as a date. Pandas is conservative — it does not assume a string
that looks like a date is a date. You can fix that:
Now joined is datetime64[ns] and you can do date arithmetic
on it — sort chronologically, filter by year, compute the number
of days since onboarding. We will return to dates and times in
their own chapter.
Memory layout: rows or columns?
Spreadsheets display data in rows and columns, but most modern analytical systems — including Pandas — store data column by column in memory. Why?
- All values in a column have the same type (all integers, all strings, all dates). That means they can be packed tightly and processed quickly.
- Most analytical queries touch a few columns of many rows
(
mean of salary,count of department), not many columns of a few rows. Column storage makes those queries fast.
This is one of the reasons Pandas is fast on operations like
df["age"].mean() — it can stride through one contiguous
chunk of memory and skip the others entirely.
The journey of pd.read_csv("orders.csv")
Walking through what happens when you call this seemingly trivial line:
- Locate the file. Pandas asks the operating system for the bytes at the given path (or, with a URL, asks the network library to fetch them over HTTPS).
- Read bytes. A buffer of raw bytes is loaded into RAM.
- Decode text. The bytes are interpreted as text using an
encoding (usually UTF-8). If the encoding is wrong, you get
the famous
'utf-8' codec can't decode byteerrors. - Split into lines and cells. Pandas's CSV parser walks through the buffer, splitting on newlines (rows) and commas (cells), respecting quotes.
- Infer types. For each column, it samples values and guesses an appropriate dtype.
- Allocate columns. Pandas reserves a contiguous block of memory per column, in the inferred type.
- Fill columns. Each parsed value is written into its column's memory at the right row index.
- Build the DataFrame. Wrap the columns with an Index (the row labels) and a list of column names. Return.
For a 1MB file this takes a few tens of milliseconds. For a 1GB file it takes a few seconds. Pandas's CSV parser is one of the most heavily optimized pieces of code in the library because everyone calls it.
What can go wrong in step 5
Type inference is where almost all Pandas surprises originate:
- A column with mostly numbers and one stray "N/A" gets inferred
as
object(string), because one non-number was found. - A ZIP code column with values like
04210gets inferred asint64, stripping the leading zero forever. - A column of dates in
MM/DD/YYYYformat on a machine that expectsDD/MM/YYYYsilently swaps month and day. - A boolean column written as
"true"/"false"(lowercase) becomesobject, notbool.
We will cover defensive reading in the Loading Datasets
chapter. For now, the lesson is just: always check df.dtypes
after reading a new dataset. It is two seconds of work that
saves hours later.
A quick experiment
Let us deliberately confuse Pandas and watch it happen.
To prevent this you can tell Pandas the type explicitly:
This is a real bug pattern in production analytics — the kind of thing that costs careers when a "data cleaning" step quietly corrupts the source data.
Memory and your laptop
A 1-million-row DataFrame with 10 numeric columns is about 80
MB in memory (8 bytes per float64 × 10 columns × 1M rows). A
typical laptop has 8–32 GB of RAM, so this is small. But:
- A 100M-row × 50-column dataset is ~40 GB — too big.
- Strings (
objectdtype) take much more memory per value than numbers because each string is a separate Python object. - Lots of duplication in a string column can be collapsed by
converting to
categorydtype.
We will look at memory in detail later. For now, just know that Pandas lives entirely in RAM, which is both its biggest strength (instant access, no I/O) and its biggest limitation (your dataset must fit).
Check your understanding
A CSV file on disk is, at the lowest level:
A spreadsheet
A binary database table
A sequence of bytes (typically interpreted as text) with rows separated by newlines and values separated by commas
A proprietary Pandas format
Why might Pandas read a ZIP code column like 04210 as the integer 4210?
Pandas has a bug
The CSV file is corrupted
Pandas inferred the column as int64 (because all values look like integers) and integers cannot have leading zeros — the 0 is dropped silently
ZIP codes are not real data
Why does Pandas store data column-by-column rather than row-by-row?
It uses less disk space
It is required by Python
All values in a column share a type, which allows tight packing and fast operations; analytical queries also tend to scan a few columns of many rows
Pandas was invented after rows existed
A column containing the values "true", "false", "true" is loaded by Pandas. What dtype is it most likely to get by default?
bool
int64
object (Python strings)
category
What Data Analysis Is
A plain-language definition of data analysis, the questions it tries to answer, and the four basic kinds of analytical work.
Rows and Columns
The two-dimensional shape every dataset eventually wears, and why thinking in rows-as-records / columns-as-attributes will unlock the rest of the course.