Exporting Cleaned Data
Saving your work — CSV, Excel, Parquet — and the choices that decide whether your colleagues will love you tomorrow.
After all the loading, cleaning, and transforming, you need to hand off the result. Pandas can write to dozens of formats — this page covers the ones that matter most.
CSV — the universal lowest common denominator
index=False is almost always what you want — otherwise you
get a mystery unnamed column when someone re-reads the file.
CSV gotchas
- Text fields containing commas need quoting (Pandas handles this automatically).
- Dates become strings — must re-parse on load.
- Dtypes are guessed on re-read, which can change them.
- Very large files become slow.
Excel — the format your stakeholders want
Multiple sheets in one file:
Excel preserves types better than CSV but the files are bigger and the format is messier under the hood.
Parquet — the analyst's favorite
Parquet is:
- Columnar — fast for analytics queries.
- Typed — datetimes stay datetimes; categories stay categories.
- Compressed — files are often 10× smaller than CSV.
- Standard — readable by Pandas, Spark, DuckDB, R, etc.
The only downside: non-technical users can't open a .parquet
in Excel without help.
The big trade-off
| Format | Human-readable | Preserves types | Size | Speed |
|---|---|---|---|---|
| CSV | ✅ | ❌ | Large | Slow on big data |
| Excel | ⚠️ (via Excel) | ⚠️ | Larger | Slow |
| Parquet | ❌ | ✅ | Small | Fast |
| JSON | ✅ | ⚠️ | Large | Slow |
| Pickle | ❌ | ✅ (Pandas-only) | Medium | Fast — but unsafe across versions |
A common pattern: Parquet for internal handoffs; CSV/Excel for sharing with humans.
Writing checklist
Before you df.to_csv(...):
Date-stamped filenames
A small habit with huge dividends:
sales_clean_2024-03-18.csv is much friendlier than
sales_clean (3).csv.
Round-tripping — verify your export
Before declaring "done", read the file back and confirm:
You'd be amazed how often the answer to "did the export work?" is "...no, the dtypes changed."
Handling categorical and missing data on export
- CSV writes NaN as an empty string by default; control
with
na_rep="NA"if needed. - CSV has no concept of "category" type — it becomes a plain string column on re-read.
- Parquet preserves both NaN and category type, which is why it's preferred for intermediate storage.
Mini challenge
Given the raw DataFrame df (provided), produce a cleaned DataFrame called clean with:
- All column names lower-cased.
- A column
full_namemade by combiningfirstandlastwith a space, then dropping the original two columns. - Rows where
ageis missing dropped. - The result written to CSV at
/tmp/clean.csvwith no index. - A round-trip read into a variable
backthat should equalcleanin shape and columns.
Check your understanding
Why pass index=False when calling to_csv?
It is faster
It hides PII
Otherwise the DataFrame's index becomes a mystery first column ("Unnamed: 0") when someone reads the file back
It is required
Which format would you choose for an intermediate file that another Pandas script will load tomorrow?
CSV
Excel
Parquet — preserves dtypes (datetimes, categories), is smaller, and reads faster
JSON
A stakeholder asks for the cleaned dataset to drop into their pivot table. Best format?
Parquet
Pickle
Excel (.xlsx) or CSV — formats they can open without code
HDF5