Dataslope logoDataslope

String Operations

The .str accessor — vectorized string methods that operate on whole columns of text without writing a loop.

Real-world data is full of text: names, addresses, product codes, comments, log messages. Pandas exposes nearly all of Python's string methods on whole columns via the .str accessor.

The basic shape

Any text column has a .str namespace. Methods you call on it behave element-wise.

Code Block
Python 3.13.2

Chaining string methods

These compose naturally:

Code Block
Python 3.13.2

Search and contains

.contains returns a boolean mask — perfect for filters:

Code Block
Python 3.13.2

By default, .contains uses regex. Use regex=False for literal substring matching (faster and safer when the pattern contains special chars).

Split, slice, and extract

Code Block
Python 3.13.2

.split(..., expand=True) is one of the most useful patterns — it lets you turn one column into many in a single call.

Extract with regex

When the pattern is more structured, str.extract pulls out named groups.

Code Block
Python 3.13.2

Named groups become column names. This is hugely useful for turning log files into a real DataFrame.

Replace — substrings and regex

Code Block
Python 3.13.2

Padding, justifying, and case

A grab bag of useful methods:

Code Block
Python 3.13.2

Length, count, and matches

Code Block
Python 3.13.2

A small cleaning example

Bringing several pieces together — normalizing employee names:

Code Block
Python 3.13.2

Working with NaN in string columns

By default, string methods propagate NaN — a NaN input yields a NaN output. Boolean methods (contains, startswith) can optionally fill NaN with True/False using na=.

Code Block
Python 3.13.2

This na=False is essential when using .contains inside a boolean filter — otherwise you get cryptic errors about NaN in boolean indexers.

Cleaning a messy column end-to-end

Challenge
Python 3.13.2
Normalize a messy job title column

Given a Series titles with messy values, produce a Series clean where:

  1. Leading/trailing whitespace is removed.
  2. Multiple internal spaces are collapsed to a single space.
  3. The result is title-cased.
  4. Empty strings become pd.NA.
  5. The strings "data scientist" and "data scientists" (any casing) both become "Data Scientist".

Use vectorized .str methods.

Check your understanding

QuestionSelect one

Why use the .str accessor at all?

Pandas requires it

It is faster than Python loops

It provides vectorized string operations on whole columns of text — clean syntax and far faster than writing a Python for-loop over each value

It works only on integers

QuestionSelect one

Calling s.str.contains("foo") on a column with a NaN value will:

Return True everywhere

Return False everywhere

Propagate the NaN — the result has NaN at that position by default

Throw an error

QuestionSelect one

What is the practical advantage of str.split("@", expand=True)?

It is faster than not expanding

It returns a Series of lists

It returns a DataFrame, one column per split piece — perfect for splitting one column into many

It removes the @ sign

On this page