Exporting Cleaned Data

Saving your work — CSV, Excel, Parquet — and the choices that decide whether your colleagues will love you tomorrow.

After all the loading, cleaning, and transforming, you need to hand off the result. Pandas can write to dozens of formats — this page covers the ones that matter most.

CSV — the universal lowest common denominator

index=False is almost always what you want — otherwise you get a mystery unnamed column when someone re-reads the file.

CSV gotchas

Text fields containing commas need quoting (Pandas handles this automatically).
Dates become strings — must re-parse on load.
Dtypes are guessed on re-read, which can change them.
Very large files become slow.

Excel — the format your stakeholders want

Multiple sheets in one file:

Excel preserves types better than CSV but the files are bigger and the format is messier under the hood.

Parquet — the analyst's favorite

Parquet is:

Columnar — fast for analytics queries.
Typed — datetimes stay datetimes; categories stay categories.
Compressed — files are often 10× smaller than CSV.
Standard — readable by Pandas, Spark, DuckDB, R, etc.

The only downside: non-technical users can't open a .parquet in Excel without help.

The big trade-off

Format	Human-readable	Preserves types	Size	Speed
CSV	✅	❌	Large	Slow on big data
Excel	⚠️ (via Excel)	⚠️	Larger	Slow
Parquet	❌	✅	Small	Fast
JSON	✅	⚠️	Large	Slow
Pickle	❌	✅ (Pandas-only)	Medium	Fast — but unsafe across versions

A common pattern: Parquet for internal handoffs; CSV/Excel for sharing with humans.

Writing checklist

Before you df.to_csv(...):

Date-stamped filenames

A small habit with huge dividends:

sales_clean_2024-03-18.csv is much friendlier than sales_clean (3).csv.

Round-tripping — verify your export

Before declaring "done", read the file back and confirm:

You'd be amazed how often the answer to "did the export work?" is "...no, the dtypes changed."

Handling categorical and missing data on export

CSV writes NaN as an empty string by default; control with na_rep="NA" if needed.
CSV has no concept of "category" type — it becomes a plain string column on re-read.
Parquet preserves both NaN and category type, which is why it's preferred for intermediate storage.

Mini challenge

Given the raw DataFrame df (provided), produce a cleaned DataFrame called clean with:

All column names lower-cased.
A column full_name made by combining first and last with a space, then dropping the original two columns.
Rows where age is missing dropped.
The result written to CSV at /tmp/clean.csv with no index.
A round-trip read into a variable back that should equal clean in shape and columns.

Check your understanding

QuestionSelect one

Why pass index=False when calling to_csv?

It is faster

It hides PII

Otherwise the DataFrame's index becomes a mystery first column ("Unnamed: 0") when someone reads the file back

It is required

QuestionSelect one

Which format would you choose for an intermediate file that another Pandas script will load tomorrow?

CSV

Excel

Parquet — preserves dtypes (datetimes, categories), is smaller, and reads faster

JSON

QuestionSelect one

A stakeholder asks for the cleaned dataset to drop into their pivot table. Best format?

Parquet

Pickle

Excel (.xlsx) or CSV — formats they can open without code

HDF5

Debugging Analysis Code

Why analysts spend half their time debugging — and the systematic habits that make it less painful.

Project Organization

How to lay out files, folders, and notebooks so an analysis stays understandable months after you wrote it.

CSV — the universal lowest common denominator CSV gotchas Excel — the format your stakeholders want Parquet — the analyst's favorite The big trade-off Writing checklist Date-stamped filenames Round-tripping — verify your export Handling categorical and missing data on export Mini challenge Check your understanding

Exporting Cleaned Data

On this page