Rise of Digital Datasets
How the internet, sensors, and cheap storage turned data from a scarce resource into a flood — and why analysts needed new tools to keep up.
The 1980s spreadsheet was designed for the world it was born into: small, slow, expensive computers, holding small datasets, typed in by hand. Within twenty years, every one of those assumptions had been quietly demolished.
The internet became a data firehose
In 1990 the world had about 300,000 internet hosts. In 2000 it had 100 million. In 2020 it had over 4 billion users producing content — clicks, posts, searches, purchases, GPS pings — every second. Every one of those interactions is a row in some dataset somewhere.
Three forces drove this growth:
- Cheap storage. A megabyte of disk cost 0.00001 in 2020. That is a ten-billion-fold drop. Suddenly there was no reason to throw data away.
- Cheap sensors. A smartphone has a GPS, an accelerometer, a gyroscope, a magnetometer, two cameras, a microphone, and a barometer — all producing data constantly.
- Cheap collection. A line of JavaScript on a webpage can send a click event back to a server in microseconds. You do not have to ask the user, fill in a form, or wait for a keypress.
What "big" actually means
Before going further, let us be honest about scale. When people say "big data," they usually mean one of three different things:
- Bigger than a spreadsheet. Hundreds of thousands to tens of millions of rows. Comfortable for Pandas on a laptop.
- Bigger than memory. Tens of millions to billions of rows. Requires chunking, on-disk formats (Parquet), or out-of-core tools (Dask, Polars, DuckDB).
- Bigger than one machine. Petabytes and up. Requires distributed systems (Spark, BigQuery, Snowflake).
This course lives almost entirely in the first zone, with a few visits to the second. The third zone is a different (and much more expensive) world. The good news: the concepts you learn in the first zone — DataFrames, group-bys, joins, missing data, reshape — carry over directly to all three.
Most analyst work is small data
A common misconception is that all modern data work is "big data." In practice, the great majority of business analysis fits comfortably on a laptop with Pandas. Even at large tech companies, most ad-hoc analyses operate on filtered, aggregated extracts that are small enough to load in memory. Big-data tools are usually used to produce the small dataset that the analyst actually explores.
Where digital datasets live
When you start analyzing real data, you will encounter it in many shapes. Five of the most common:
- CSV (comma-separated values). A plain text file with one row per line and columns separated by commas. Easy to read, easy to write, the universal currency of small data. We will use these a lot.
- Excel files (
.xlsx). Spreadsheets, with all their power (multiple sheets, formulas, formatting) and all their pain (Excel's auto-conversions, hidden cells). - JSON. Hierarchical, nested data — common in web APIs. Often has to be flattened into a tabular shape before analysis.
- Parquet. A columnar binary format optimized for analytical queries. Faster and smaller than CSV but not human-readable.
- SQL databases. Tables that live on a server. Pandas can query them and pull results into DataFrames.
Pandas's superpower is that, once data is inside a DataFrame, the rest of your analysis does not care where it came from. You might pull from five different sources and join them together in one notebook without changing your downstream code.
What digital datasets are usually not
Real-world digital datasets are messy. As we go through the course you will see all of these, but it helps to have the vocabulary up front:
- Incomplete. Survey questions get skipped, sensors drop packets, users abandon forms. Missing values are everywhere.
- Inconsistent. "USA", "U.S.A.", "United States", and "Murica" might all mean the same country.
- Duplicated. The same event recorded by two pipelines, or a user double-clicking a button.
- Mislabeled. Columns named
id_1,id_2,temp,x,value. What do they mean? Hope there is documentation. - Wrongly typed. A "ZIP code" column where Excel helpfully
stripped the leading zero from
04210. - Biased. The data is only of people who answered the survey, only of users who clicked the button, only of patients who visited the clinic.
Most of an analyst's day is spent dealing with the items on this
list, not running the eventual mean() at the end. We will
devote multiple chapters to this work.
A first encounter with a real dataset
Let us actually load a real digital dataset — a public HR file with employee attributes from a fictional company — and look at its shape.
That is a tabular dataset of about 15,000 employees and around a dozen columns. We have not analyzed anything yet — we just loaded it. But notice what already changed compared to the spreadsheet era:
- We never opened a file dialog.
- The data lives on the public internet and came to us over HTTP.
- The shape and column names are programmatically inspectable.
- The next person who reads our code can re-run these two lines and get exactly the same dataset back.
This is the workflow we will use through the rest of the course.
Mid-page check
Which of these is the most accurate definition of "big data" as discussed in this chapter?
Any dataset larger than 1,000 rows
"Big data" is context-dependent — usually it means a dataset that does not fit on one machine's memory or disk, requiring distributed systems
Any dataset that contains numbers
Any dataset that requires Pandas
Why this matters for your career
Almost every modern job that involves data — analyst, product manager, scientist, engineer, marketer — assumes you can:
- Find the data (databases, APIs, files, dashboards).
- Load it into a tool you can manipulate.
- Clean it.
- Slice and summarize it.
- Visualize the result.
- Write down what you did so others can reproduce it.
Pandas is currently the dominant tool for steps 2–5 in the Python world. Learning it well will let you move fluidly between roles and industries because the operations are the same whether the dataset is healthcare claims, ad impressions, sensor readings, or sales orders.
A short exercise
Pick a question — a real one — about the HR dataset above:
- What is the average tenure?
- Which department has the most employees?
- How many employees left the company?
Hold onto that question. By the end of the Aggregation & GroupBy chapter, you will be able to answer it in one line.
Check your understanding
Why did digital datasets explode in size starting around the year 2000?
Pandas was released and people had to fill it with data
Excel was upgraded
A combination of cheaper storage, ubiquitous sensors (especially smartphones), and effortless web-based collection (clickstreams, APIs)
Hard drives became more reliable
What does Pandas do once data is loaded into a DataFrame, regardless of whether the source was a CSV, Excel file, JSON, Parquet, or SQL?
It writes the data back to the source automatically
Downstream code can manipulate the DataFrame in the same way regardless of where it originally came from
It compresses the data
It encrypts the data
A junior analyst says "we have big data — about 40,000 rows." How should you (gently) respond?
They are correct
40,000 rows fits comfortably in memory on any laptop; Pandas will handle it instantly, and "big data" is normally reserved for datasets that no longer fit in memory or on one machine
They are correct, but only on Tuesdays
Spreadsheets and Business
How VisiCalc, Lotus 1-2-3, and Excel turned the personal computer into the dominant business tool of the late 20th century — and why every analyst still starts there.
Birth of Data Science
How statistics, computing, and business intelligence merged into a new discipline — and why Harvard Business Review called it the sexiest job of the 21st century.