Dataslope logoDataslope

Exporting Cleaned Data

Saving your work — CSV, Excel, Parquet — and the choices that decide whether your colleagues will love you tomorrow.

After all the loading, cleaning, and transforming, you need to hand off the result. Pandas can write to dozens of formats — this page covers the ones that matter most.

CSV — the universal lowest common denominator

Code Block
Python 3.13.2

index=False is almost always what you want — otherwise you get a mystery unnamed column when someone re-reads the file.

CSV gotchas

  • Text fields containing commas need quoting (Pandas handles this automatically).
  • Dates become strings — must re-parse on load.
  • Dtypes are guessed on re-read, which can change them.
  • Very large files become slow.

Excel — the format your stakeholders want

Code Block
Python 3.13.2

Multiple sheets in one file:

Code Block
Python 3.13.2

Excel preserves types better than CSV but the files are bigger and the format is messier under the hood.

Parquet — the analyst's favorite

Code Block
Python 3.13.2

Parquet is:

  • Columnar — fast for analytics queries.
  • Typed — datetimes stay datetimes; categories stay categories.
  • Compressed — files are often 10× smaller than CSV.
  • Standard — readable by Pandas, Spark, DuckDB, R, etc.

The only downside: non-technical users can't open a .parquet in Excel without help.

The big trade-off

FormatHuman-readablePreserves typesSizeSpeed
CSVLargeSlow on big data
Excel⚠️ (via Excel)⚠️LargerSlow
ParquetSmallFast
JSON⚠️LargeSlow
Pickle✅ (Pandas-only)MediumFast — but unsafe across versions

A common pattern: Parquet for internal handoffs; CSV/Excel for sharing with humans.

Writing checklist

Before you df.to_csv(...):

Date-stamped filenames

A small habit with huge dividends:

Code Block
Python 3.13.2

sales_clean_2024-03-18.csv is much friendlier than sales_clean (3).csv.

Round-tripping — verify your export

Before declaring "done", read the file back and confirm:

Code Block
Python 3.13.2

You'd be amazed how often the answer to "did the export work?" is "...no, the dtypes changed."

Handling categorical and missing data on export

  • CSV writes NaN as an empty string by default; control with na_rep="NA" if needed.
  • CSV has no concept of "category" type — it becomes a plain string column on re-read.
  • Parquet preserves both NaN and category type, which is why it's preferred for intermediate storage.

Mini challenge

Challenge
Python 3.13.2
Clean, export, and verify

Given the raw DataFrame df (provided), produce a cleaned DataFrame called clean with:

  • All column names lower-cased.
  • A column full_name made by combining first and last with a space, then dropping the original two columns.
  • Rows where age is missing dropped.
  • The result written to CSV at /tmp/clean.csv with no index.
  • A round-trip read into a variable back that should equal clean in shape and columns.

Check your understanding

QuestionSelect one

Why pass index=False when calling to_csv?

It is faster

It hides PII

Otherwise the DataFrame's index becomes a mystery first column ("Unnamed: 0") when someone reads the file back

It is required

QuestionSelect one

Which format would you choose for an intermediate file that another Pandas script will load tomorrow?

CSV

Excel

Parquet — preserves dtypes (datetimes, categories), is smaller, and reads faster

JSON

QuestionSelect one

A stakeholder asks for the cleaned dataset to drop into their pivot table. Best format?

Parquet

Pickle

Excel (.xlsx) or CSV — formats they can open without code

HDF5

On this page