Creating New Columns

Computed columns, conditional columns, mapped columns, and the in-place vs. assign trade-off.

Most analyses involve creating new columns: ratios, flags, buckets, formatted strings. Pandas gives you several ways. Each fits a different situation.

The simplest case — arithmetic

The whole column appears on both sides of =. That is the vectorized style: no for-loop, just whole-column arithmetic.

`.assign()` — chainable column creation

.assign() produces a new DataFrame with the added columns, which makes it perfect for method chains.

The lambda x: ... lets .assign see the current state of the DataFrame (including columns added earlier in the chain). This matters when you want one new column to reference another.

Mutate-in-place vs. assign

The same logical operation can be expressed two ways:

# In place — modifies df
df["total_comp"] = df["salary"] + df["bonus"]

# Functional — returns a new DataFrame
df2 = df.assign(total_comp=df["salary"] + df["bonus"])

Both are widely used. A modern convention: use .assign() inside long method chains; use direct assignment for one-off tweaks. Either way, immutability is the analyst's friend — keeping the original df untouched makes it easier to reason about later cells.

Mutable vs immutable thinking

Mutable operations change the DataFrame in place; immutable operations return a new one. Pandas supports both. Beginners tend to overuse mutable (and lose track of state); experienced analysts often prefer immutable, even at the cost of an extra variable.

Conditional columns with `np.where`

np.where(condition, value_if_true, value_if_false) is a vector ized ternary.

Multi-condition: `np.select`

For more than two branches, np.select is cleaner than nested np.wheres.

Mapping with dictionaries: `.map`

When the new column is a lookup from old values:

.map(dict) is the quintessential dictionary-as-lookup pattern. Values not in the dict become NaN.

`.apply` — when you really need a function

When a column's value depends on more complex logic, fall back to .apply. It is slower than vectorized operations, but flexible.

For one column → use Series.apply(func). For multiple columns per row → use DataFrame.apply(func, axis=1), but expect it to be slow on big data; prefer vectorized expressions when you can.

Binning continuous values

pd.cut slices a continuous column into named buckets:

For equal-frequency buckets (quartiles, quintiles) use pd.qcut:

A composite example

Put several techniques together — an HR dataset with derived fields.

Check your understanding

QuestionSelect one

What is the difference between df["x"] = ... and df.assign(x=...)?

Nothing

The first is faster

Direct assignment modifies df in place; .assign returns a new DataFrame and leaves the original untouched

.assign is a method on Series

QuestionSelect one

When would you reach for np.select instead of np.where?

When you want to filter

When you want to drop columns

When you need to assign one of more than two values depending on multiple conditions

When you want to sort

QuestionSelect one

df["currency"].map({"USD": 1.00, "EUR": 1.08}) returns NaN for rows whose currency is "GBP". Why?

Pandas does not support strings in maps

The map is corrupted

.map returns NaN for any value not found in the mapping dict; "GBP" is missing so the result is NaN

The dict is read-only

Sorting and Ranking

Ordering rows by one or more columns, and assigning ranks within groups.

Aggregation Basics

Sum, mean, median, min, max — the simple reductions that turn many rows into one number, and the subtle choices that change what they mean.

The simplest case — arithmetic .assign() — chainable column creation Mutate-in-place vs. assign Conditional columns with np.whereMulti-condition: np.selectMapping with dictionaries: .map.apply — when you really need a function Binning continuous values A composite example Check your understanding

Creating New Columns

On this page