Creating New Columns
Computed columns, conditional columns, mapped columns, and the in-place vs. assign trade-off.
Most analyses involve creating new columns: ratios, flags, buckets, formatted strings. Pandas gives you several ways. Each fits a different situation.
The simplest case — arithmetic
The whole column appears on both sides of =. That is the
vectorized style: no for-loop, just whole-column arithmetic.
.assign() — chainable column creation
.assign() produces a new DataFrame with the added columns,
which makes it perfect for method chains.
The lambda x: ... lets .assign see the current state of the
DataFrame (including columns added earlier in the chain). This
matters when you want one new column to reference another.
Mutate-in-place vs. assign
The same logical operation can be expressed two ways:
# In place — modifies df
df["total_comp"] = df["salary"] + df["bonus"]
# Functional — returns a new DataFrame
df2 = df.assign(total_comp=df["salary"] + df["bonus"])Both are widely used. A modern convention: use .assign() inside
long method chains; use direct assignment for one-off tweaks.
Either way, immutability is the analyst's friend — keeping
the original df untouched makes it easier to reason about
later cells.
Mutable vs immutable thinking
Mutable operations change the DataFrame in place; immutable operations return a new one. Pandas supports both. Beginners tend to overuse mutable (and lose track of state); experienced analysts often prefer immutable, even at the cost of an extra variable.
Conditional columns with np.where
np.where(condition, value_if_true, value_if_false) is a vector
ized ternary.
Multi-condition: np.select
For more than two branches, np.select is cleaner than nested
np.wheres.
Mapping with dictionaries: .map
When the new column is a lookup from old values:
.map(dict) is the quintessential dictionary-as-lookup pattern.
Values not in the dict become NaN.
.apply — when you really need a function
When a column's value depends on more complex logic, fall back
to .apply. It is slower than vectorized operations, but
flexible.
For one column → use Series.apply(func). For multiple
columns per row → use DataFrame.apply(func, axis=1), but
expect it to be slow on big data; prefer vectorized expressions
when you can.
Binning continuous values
pd.cut slices a continuous column into named buckets:
For equal-frequency buckets (quartiles, quintiles) use
pd.qcut:
A composite example
Put several techniques together — an HR dataset with derived fields.
Check your understanding
What is the difference between df["x"] = ... and df.assign(x=...)?
Nothing
The first is faster
Direct assignment modifies df in place; .assign returns a new DataFrame and leaves the original untouched
.assign is a method on Series
When would you reach for np.select instead of np.where?
When you want to filter
When you want to drop columns
When you need to assign one of more than two values depending on multiple conditions
When you want to sort
df["currency"].map({"USD": 1.00, "EUR": 1.08}) returns NaN for rows whose currency is "GBP". Why?
Pandas does not support strings in maps
The map is corrupted
.map returns NaN for any value not found in the mapping dict; "GBP" is missing so the result is NaN
The dict is read-only