Dataslope logoDataslope

How Computers See Data

A gentle look under the hood — bits, bytes, files, types, and the long chain that turns a CSV on disk into rows and columns in Pandas.

You can use Pandas productively for years without understanding how the bits get from disk to memory. But a little mental model of what is going on underneath will save you from mysterious bugs — why is this column a string when I expected a number? — and it will make the rest of the course feel less like magic.

A computer only knows numbers

At the lowest level, every byte your computer stores is a number between 0 and 255. There are no strings, no dates, no booleans, no DataFrames — just bytes.

The trick is interpretation. The same eight bits can mean a small integer, a character of text, a fraction of a pixel color, or a step in an audio waveform. The interpretation is imposed by the program that reads them.

Files on disk are long sequences of bytes plus a tiny bit of metadata saying what kind of bytes they are (the file extension, sometimes a "magic number" at the start). The sequence does not know it is a CSV until something reads it as one.

A CSV is plain text with commas

The simplest data format in the world: each line is a row, each value is separated by a comma, the first line is often a header.

name,age,city
Aiko,29,Tokyo
Bilal,42,Karachi
Chen,35,Shanghai

That is literally a complete dataset. You could write it in Notepad. Open it in Excel and it looks like a table. Open it in Pandas and it looks like a DataFrame. The bytes on disk are identical.

Why CSV is everywhere

CSV's strengths: human-readable, tiny dependency footprint, portable across every operating system and language since 1972. Its weaknesses: no type information (everything is text until a reader guesses), bad with commas inside values (which need quoting), and no schema. Despite all of that, it is the most widely exchanged tabular data format in the world.

Types — what a byte "is"

When Pandas reads the CSV above, it does not just store characters. It looks at each column and guesses a type:

  • name → text (string).
  • age → small integer.
  • city → text.

This is called type inference, and it is the single most common source of frustration for new Pandas users. Pandas guesses well most of the time, but when it guesses wrong, silent bugs follow.

The main types you will encounter:

Pandas dtypeWhat it representsExample values
int64Whole numbers1, 42, -7
float64Decimal numbers (and NaNs)3.14, 1.5e9, NaN
boolTrue/FalseTrue, False
objectStrings, or mixed types"Aiko", "Tokyo"
datetime64Timestamps2024-03-18 14:00:00
categoryA fixed set of repeated labels"East", "West"

Let us see them in action.

Code Block
Python 3.13.2

Notice that joined came in as object (a plain string), not as a date. Pandas is conservative — it does not assume a string that looks like a date is a date. You can fix that:

Code Block
Python 3.13.2

Now joined is datetime64[ns] and you can do date arithmetic on it — sort chronologically, filter by year, compute the number of days since onboarding. We will return to dates and times in their own chapter.

Memory layout: rows or columns?

Spreadsheets display data in rows and columns, but most modern analytical systems — including Pandas — store data column by column in memory. Why?

  • All values in a column have the same type (all integers, all strings, all dates). That means they can be packed tightly and processed quickly.
  • Most analytical queries touch a few columns of many rows (mean of salary, count of department), not many columns of a few rows. Column storage makes those queries fast.

This is one of the reasons Pandas is fast on operations like df["age"].mean() — it can stride through one contiguous chunk of memory and skip the others entirely.

The journey of pd.read_csv("orders.csv")

Walking through what happens when you call this seemingly trivial line:

  1. Locate the file. Pandas asks the operating system for the bytes at the given path (or, with a URL, asks the network library to fetch them over HTTPS).
  2. Read bytes. A buffer of raw bytes is loaded into RAM.
  3. Decode text. The bytes are interpreted as text using an encoding (usually UTF-8). If the encoding is wrong, you get the famous 'utf-8' codec can't decode byte errors.
  4. Split into lines and cells. Pandas's CSV parser walks through the buffer, splitting on newlines (rows) and commas (cells), respecting quotes.
  5. Infer types. For each column, it samples values and guesses an appropriate dtype.
  6. Allocate columns. Pandas reserves a contiguous block of memory per column, in the inferred type.
  7. Fill columns. Each parsed value is written into its column's memory at the right row index.
  8. Build the DataFrame. Wrap the columns with an Index (the row labels) and a list of column names. Return.

For a 1MB file this takes a few tens of milliseconds. For a 1GB file it takes a few seconds. Pandas's CSV parser is one of the most heavily optimized pieces of code in the library because everyone calls it.

What can go wrong in step 5

Type inference is where almost all Pandas surprises originate:

  • A column with mostly numbers and one stray "N/A" gets inferred as object (string), because one non-number was found.
  • A ZIP code column with values like 04210 gets inferred as int64, stripping the leading zero forever.
  • A column of dates in MM/DD/YYYY format on a machine that expects DD/MM/YYYY silently swaps month and day.
  • A boolean column written as "true"/"false" (lowercase) becomes object, not bool.

We will cover defensive reading in the Loading Datasets chapter. For now, the lesson is just: always check df.dtypes after reading a new dataset. It is two seconds of work that saves hours later.

A quick experiment

Let us deliberately confuse Pandas and watch it happen.

Code Block
Python 3.13.2

To prevent this you can tell Pandas the type explicitly:

Code Block
Python 3.13.2

This is a real bug pattern in production analytics — the kind of thing that costs careers when a "data cleaning" step quietly corrupts the source data.

Memory and your laptop

A 1-million-row DataFrame with 10 numeric columns is about 80 MB in memory (8 bytes per float64 × 10 columns × 1M rows). A typical laptop has 8–32 GB of RAM, so this is small. But:

  • A 100M-row × 50-column dataset is ~40 GB — too big.
  • Strings (object dtype) take much more memory per value than numbers because each string is a separate Python object.
  • Lots of duplication in a string column can be collapsed by converting to category dtype.

We will look at memory in detail later. For now, just know that Pandas lives entirely in RAM, which is both its biggest strength (instant access, no I/O) and its biggest limitation (your dataset must fit).

Check your understanding

QuestionSelect one

A CSV file on disk is, at the lowest level:

A spreadsheet

A binary database table

A sequence of bytes (typically interpreted as text) with rows separated by newlines and values separated by commas

A proprietary Pandas format

QuestionSelect one

Why might Pandas read a ZIP code column like 04210 as the integer 4210?

Pandas has a bug

The CSV file is corrupted

Pandas inferred the column as int64 (because all values look like integers) and integers cannot have leading zeros — the 0 is dropped silently

ZIP codes are not real data

QuestionSelect one

Why does Pandas store data column-by-column rather than row-by-row?

It uses less disk space

It is required by Python

All values in a column share a type, which allows tight packing and fast operations; analytical queries also tend to scan a few columns of many rows

Pandas was invented after rows existed

QuestionSelect one

A column containing the values "true", "false", "true" is loaded by Pandas. What dtype is it most likely to get by default?

bool

int64

object (Python strings)

category

On this page