String Operations
The .str accessor — vectorized string methods that operate on whole columns of text without writing a loop.
Real-world data is full of text: names, addresses, product
codes, comments, log messages. Pandas exposes nearly all of
Python's string methods on whole columns via the .str
accessor.
The basic shape
Any text column has a .str namespace. Methods you call on it
behave element-wise.
Chaining string methods
These compose naturally:
Search and contains
.contains returns a boolean mask — perfect for filters:
By default, .contains uses regex. Use regex=False for
literal substring matching (faster and safer when the pattern
contains special chars).
Split, slice, and extract
.split(..., expand=True) is one of the most useful patterns —
it lets you turn one column into many in a single call.
Extract with regex
When the pattern is more structured, str.extract pulls out
named groups.
Named groups become column names. This is hugely useful for turning log files into a real DataFrame.
Replace — substrings and regex
Padding, justifying, and case
A grab bag of useful methods:
Length, count, and matches
A small cleaning example
Bringing several pieces together — normalizing employee names:
Working with NaN in string columns
By default, string methods propagate NaN — a NaN input yields
a NaN output. Boolean methods (contains, startswith) can
optionally fill NaN with True/False using na=.
This na=False is essential when using .contains inside a
boolean filter — otherwise you get cryptic errors about NaN in
boolean indexers.
Cleaning a messy column end-to-end
Given a Series titles with messy values, produce a Series clean where:
- Leading/trailing whitespace is removed.
- Multiple internal spaces are collapsed to a single space.
- The result is title-cased.
- Empty strings become
pd.NA. - The strings
"data scientist"and"data scientists"(any casing) both become"Data Scientist".
Use vectorized .str methods.
Check your understanding
Why use the .str accessor at all?
Pandas requires it
It is faster than Python loops
It provides vectorized string operations on whole columns of text — clean syntax and far faster than writing a Python for-loop over each value
It works only on integers
Calling s.str.contains("foo") on a column with a NaN value will:
Return True everywhere
Return False everywhere
Propagate the NaN — the result has NaN at that position by default
Throw an error
What is the practical advantage of str.split("@", expand=True)?
It is faster than not expanding
It returns a Series of lists
It returns a DataFrame, one column per split piece — perfect for splitting one column into many
It removes the @ sign