Dataslope logoDataslope

Creating New Columns

Computed columns, conditional columns, mapped columns, and the in-place vs. assign trade-off.

Most analyses involve creating new columns: ratios, flags, buckets, formatted strings. Pandas gives you several ways. Each fits a different situation.

The simplest case — arithmetic

Code Block
Python 3.13.2

The whole column appears on both sides of =. That is the vectorized style: no for-loop, just whole-column arithmetic.

.assign() — chainable column creation

.assign() produces a new DataFrame with the added columns, which makes it perfect for method chains.

Code Block
Python 3.13.2

The lambda x: ... lets .assign see the current state of the DataFrame (including columns added earlier in the chain). This matters when you want one new column to reference another.

Mutate-in-place vs. assign

The same logical operation can be expressed two ways:

# In place — modifies df
df["total_comp"] = df["salary"] + df["bonus"]

# Functional — returns a new DataFrame
df2 = df.assign(total_comp=df["salary"] + df["bonus"])

Both are widely used. A modern convention: use .assign() inside long method chains; use direct assignment for one-off tweaks. Either way, immutability is the analyst's friend — keeping the original df untouched makes it easier to reason about later cells.

Mutable vs immutable thinking

Mutable operations change the DataFrame in place; immutable operations return a new one. Pandas supports both. Beginners tend to overuse mutable (and lose track of state); experienced analysts often prefer immutable, even at the cost of an extra variable.

Conditional columns with np.where

np.where(condition, value_if_true, value_if_false) is a vector ized ternary.

Code Block
Python 3.13.2

Multi-condition: np.select

For more than two branches, np.select is cleaner than nested np.wheres.

Code Block
Python 3.13.2

Mapping with dictionaries: .map

When the new column is a lookup from old values:

Code Block
Python 3.13.2

.map(dict) is the quintessential dictionary-as-lookup pattern. Values not in the dict become NaN.

.apply — when you really need a function

When a column's value depends on more complex logic, fall back to .apply. It is slower than vectorized operations, but flexible.

Code Block
Python 3.13.2

For one column → use Series.apply(func). For multiple columns per row → use DataFrame.apply(func, axis=1), but expect it to be slow on big data; prefer vectorized expressions when you can.

Binning continuous values

pd.cut slices a continuous column into named buckets:

Code Block
Python 3.13.2

For equal-frequency buckets (quartiles, quintiles) use pd.qcut:

Code Block
Python 3.13.2

A composite example

Put several techniques together — an HR dataset with derived fields.

Code Block
Python 3.13.2

Check your understanding

QuestionSelect one

What is the difference between df["x"] = ... and df.assign(x=...)?

Nothing

The first is faster

Direct assignment modifies df in place; .assign returns a new DataFrame and leaves the original untouched

.assign is a method on Series

QuestionSelect one

When would you reach for np.select instead of np.where?

When you want to filter

When you want to drop columns

When you need to assign one of more than two values depending on multiple conditions

When you want to sort

QuestionSelect one

df["currency"].map({"USD": 1.00, "EUR": 1.08}) returns NaN for rows whose currency is "GBP". Why?

Pandas does not support strings in maps

The map is corrupted

.map returns NaN for any value not found in the mapping dict; "GBP" is missing so the result is NaN

The dict is read-only

On this page