Spreadsheets vs Code
A side-by-side look at how the same analysis feels in Excel versus Pandas — and why analysts increasingly choose code even when the data is small.
You have already heard the high-level argument: spreadsheets are manual, code is reproducible. This chapter makes the comparison concrete with side-by-side scenarios so you can feel, in your bones, where each tool wins and where each one loses.
Two tools, one job
Pretend you are an analyst at a coffee company. Every Monday morning a CSV file lands in your inbox with last week's orders. You need to produce:
- Total revenue.
- Top-selling product.
- Revenue by region.
- A chart for the weekly team meeting.
Let us do this both ways and compare the feel.
The spreadsheet way
- Open Excel. Double-click the CSV.
- Excel asks how to import. Pick "general." Click through.
- Click the column headers to confirm types. Notice that the
phonecolumn got auto-converted to scientific notation (5.55E+11). Sigh, fix. - Select the
revenuecolumn. Read the sum off the status bar. Copy it into a separate "Summary" sheet. - Sort the table by
product, eyeball the top one. Type it into the summary sheet. - Sort by
region. Use=SUMIF(B:B, "West", D:D)for each region. Copy values into summary sheet. - Highlight summary cells, Insert > Chart, pick bar.
- Save as
weekly_report_2024_03_18.xlsx. Email it. - Next Monday: do it all again.
The code way
import pandas as pd
df = pd.read_csv("orders.csv")
print("Total revenue:", df["revenue"].sum())
print("Top product:", df.groupby("product")["revenue"].sum().idxmax())
print(df.groupby("region")["revenue"].sum())
df.groupby("region")["revenue"].sum().plot.bar()Save the file as weekly_report.py. Next Monday: re-run.
What the comparison actually proves
The "code" version takes more upfront learning. There is no pretending otherwise. But once the learning is paid for, every repeated run is essentially free — and the script is itself a record of how the analysis was done.
That is the central trade-off:
| Spreadsheet | Code | |
|---|---|---|
| First-time cost | Low | Higher |
| Cost of next run | Same as first run | Near zero |
| Self-documenting | No (only final state) | Yes (the source) |
| Audit trail | Weak | Strong (with git) |
| Handles 5 million rows | Painfully or not at all | Yes |
| Easy to share visually | Yes | Requires hosting |
| Easy for non-coders to edit | Yes | No |
There is no winner in the abstract — only winners for a given problem.
The 'recurring' test
A useful heuristic: if you will do this analysis more than twice, write it in code. If it is genuinely a one-off (a meeting prep, a quick sanity check), reach for the spreadsheet. You will not regret using code for recurring work, and you will not regret using a spreadsheet for true one-offs.
Five common spreadsheet pains, in Pandas
1. Accidental edits
In Excel, every interaction modifies the file. There is no
"diff" of what you changed; no git log of why. In Pandas, your
script is the file. Diffs are line-by-line. Reviewers can ask
"why did you switch from mean to median on line 42?"
2. Copy-paste from another sheet
In Excel, this is invisible after the fact. Was that column
copied from Q4_sales.xlsx or Q4_sales_FINAL_v2.xlsx? In
Pandas, the source filename or URL is right there in your
pd.read_csv(...) call.
3. The 65k-row wall and slowness
.xls capped at 65,536 rows. .xlsx raised it but became slow.
Pandas handles tens of millions of rows on a laptop, and
millions interactively.
4. Manually-tracked transformations
In a spreadsheet, "I filtered out test users, dropped Saturday, and converted everything to USD" lives in someone's head. In Pandas:
Anyone reading this code knows exactly what you did. There is no oral history.
5. "Refresh the report"
In Excel, a refresh means an analyst manually re-doing the
work, or building a fragile chain of Power Query /
VBA / OFFSET formulas that one person on the team understands.
In Pandas, refreshing is python report.py — or scheduling that
command to run automatically every morning.
Where spreadsheets still win
It is important not to overstate the case. Pandas can be overkill for:
- Quick exploratory totals on a tiny file. Excel is faster for "how much did I spend at the grocery store last month?"
- Sharing a working model with a non-technical stakeholder. The CFO wants to play with the model — change assumptions, see results. A spreadsheet is the right vehicle.
- Visual layout. Excel is also a layout tool — borders, shading, merged cells, embedded charts. For documents whose appearance matters, it is still hard to beat.
A mature analyst uses both. The pipeline often looks like: raw data → Pandas (clean, summarize, validate) → Excel (final presentation for humans).
A hybrid example
Pandas can write .xlsx files directly. So you can do the
intelligence in code and the presentation in Excel — best of
both worlds. (We will return to this in the Exporting Cleaned
Data chapter.)
In a real workflow you would write to a real .xlsx file and
email it (or upload it to a shared drive). The analysis lives
in Python so it is reproducible; the delivery lives in Excel so
the audience can read it.
Conceptual comparison: manual vs reproducible
That right-hand pipeline — same inputs, same outputs, audit trail — is the whole game. Pandas is the most popular tool for building it in the Python world.
What to take away
- This course is not anti-spreadsheet. It is pro- reproducibility.
- The single biggest reason analysts move from spreadsheets to code is not speed; it is the ability to re-run an analysis reliably.
- The second biggest reason is transparency: code is the record of the analysis. Spreadsheets are only the result.
- The third is scale: code does not care whether the dataset has fifty rows or fifty million.
Check your understanding
Which of these is the chapter's stated biggest practical reason analysts move from spreadsheets to code?
Code is shorter
Code is more fun
Code-based analyses are reproducible — re-running on next week's data gives the same results without manual effort
Code uses less memory
In the chapter's "recurring test," what is the heuristic for choosing between a spreadsheet and Pandas?
If you will do this analysis more than twice, write it in code
If the dataset has more than 50 rows, use Pandas
If the analysis has more than 3 steps, use Excel
If you have meetings on Mondays, use Pandas
Which of these is still a legitimate strength of spreadsheets versus Pandas?
Handling millions of rows
Reproducibility
Version control via git
Letting a non-technical stakeholder play with assumptions in a visual model
The Python and Pandas Story
How Wes McKinney, frustrated with his analytics tools at a hedge fund, built the library that became Python's data analysis backbone.
What Data Analysis Is
A plain-language definition of data analysis, the questions it tries to answer, and the four basic kinds of analytical work.