The Anatomy of an NLP Pipeline
The standard sequence of preprocessing steps text flows through — tokenize, normalize, filter, reduce, analyze — seen as one picture. Why it is a pipeline of transformations, why most steps are optional, and why order matters.
On the last page we said NLP adds structure to text in layers. Those layers are usually arranged as a pipeline: a fixed sequence of steps where the output of each step becomes the input to the next. Before we spend a page each on the individual steps, it is worth seeing the whole assembly line at once, because the shape will repeat on nearly every page that follows.
The standard text preprocessing pipeline
Here is the canonical flow. Almost every classic NLP project is some subset of these steps, in roughly this order.
Read it top to bottom. Raw text comes in; a clean, normalized list of units comes out, ready to be counted or fed into a model. Three things about this picture matter more than the picture itself.
First, it is a pipeline of transformations. At each step you are holding
a value — first a string, then a list of tokens — and you apply one
transformation to get the next value. This is exactly the list-comprehension
style of Python you already know: tokens = [t.lower() for t in tokens] is
the case-folding step.
Second, most steps are optional. Only tokenization is nearly always present. Whether you lowercase, whether you remove stopwords, whether you stem or lemmatize — those are choices, and making them well is the single most important skill in classic NLP. We give that skill its own page later (Designing the Right Pipeline).
Third, order matters. Removing punctuation before tokenizing can destroy information a tokenizer needs (those periods help it find sentence ends). Lowercasing before part-of-speech tagging can confuse a tagger that uses capitalization as a clue. The same steps in a different order can give different — sometimes worse — results.
The pipeline is a sequence of transformations on a list
After tokenization, your text is just a Python list of strings, and every later step is a transformation of that list into another list: filter out some items, lowercase the rest, replace each with its root. If you are comfortable with list comprehensions, you already understand the mechanics of an NLP pipeline. The hard part is never the code — it is deciding which transformations to apply.
Watching a list flow through the pipeline
Let us make "a pipeline is a sequence of list transformations" concrete. The block below runs the standard steps and prints the list after each one, so you can watch it shrink and change. Read each printed line against the diagram above.
Each print is one row of the pipeline diagram. The list started as nine
tokens including punctuation and shrank to a handful of meaningful roots. Try
commenting out step 4 (stopword removal) and re-running — notice how much
longer the final list becomes, and ask yourself whether those extra words
would help or hurt whatever you planned to do next. That question has no
universal answer, which is exactly why these steps are choices.
Experiment: reorder the steps
Move the lowercasing step (step 2) to after stopword removal and re-run.
The stopword list is all lowercase, so comparing un-lowercased tokens like
"The" against it silently fails to remove them. This is a tiny taste of why
order matters — a theme we will keep returning to.
The two families of steps: cleaning vs. analysis
It helps to mentally split the pipeline into two halves.
The preprocessing half turns raw text into clean, comparable units. It is mostly mechanical and mostly the same from project to project. The analysis half is where you actually extract value, and it depends on your goal: a search engine counts and indexes, an autocomplete builds n-grams, a classifier builds features. Crucially, the analysis you intend to do should drive the preprocessing choices you make — not the other way around. We will see this dependency again and again.
In the pipeline, why is it accurate to say "most steps are optional, but tokenization usually is not"?
Tokenization is the only step that is fast enough to run
Almost every later step operates on tokens, so you need tokens before you can lowercase, filter, count, or tag them — whereas lowercasing, stopword removal, and stemming are choices that depend on the task
Tokenization is the last thing you do, after counting
Stopword removal must always happen before tokenization
Order matters: a concrete example
We just hinted at it; let us prove it. The block below runs the same three operations — lowercase, remove stopwords, keep alphabetic — but in two different orders, and compares the results.
Order A correctly removes all three "the"s, because by the time we compare
against the lowercase stopword list, every token is lowercase. Order B leaves
"The" and "THE" behind, because they did not match the lowercase "the"
in the stoplist at the moment we filtered. Same operations, different order,
different — and in this case wrong — result.
Order bugs are silent
Notice that Order B did not crash. It produced a perfectly reasonable-looking list that happened to be wrong. Pipeline ordering bugs almost never raise an error; they quietly corrupt your data and you only notice when your downstream results look off. Reasoning carefully about order is a real part of the craft.
Your turn: assemble a mini pipeline
Time to build the cleaning half yourself. The challenge below asks you to
write a preprocess function that runs the standard cleaning steps in the
right order. The hidden tests check the output on a couple of sentences,
including one where a stopword changes everything.
Write a function preprocess(text) that returns a list of clean tokens by
running these steps in this order:
- Word-tokenize the text with
word_tokenize. - Lowercase every token.
- Keep only tokens that are fully alphabetic (use
.isalpha()to drop punctuation and numbers). - Remove English stopwords (use the provided
stopset).
Return the resulting list. For example, preprocess("The quick brown foxes!") should return ['quick', 'brown', 'foxes'].
The word_tokenize function and the stop set are already available from
the setup section.
Notice what the second test taught you
preprocess("It is NOT a good day.") returned ['good', 'day']. The word
"not" vanished — it is on NLTK's stopword list — and with it, the entire
meaning of the sentence flipped from negative to positive. That is a preview
of one of the most important cautionary tales in this course, which we will
tell in full on the stopwords page. Removing stopwords is not
free; sometimes it throws away the very words that matter.
Check your understanding
Why can applying the same preprocessing steps in a different order produce different results?
It cannot — order never affects the outcome of preprocessing
Each step transforms the data the next step sees, so (for example) filtering before lowercasing compares differently-cased tokens against a lowercase stopword list and misses some
Reordering steps changes which version of Python runs
Only the first step has any effect; the rest are ignored
A colleague says, "I always remove stopwords and stem every text, no matter the project — it's just good hygiene." What is the best response?
They are right; more preprocessing is always better
Preprocessing steps are choices that should be driven by the downstream task; some steps that help one task (like topic counting) actively hurt another (like sentiment analysis)
They are right, because stemming and stopword removal are required by NLTK
They are wrong because preprocessing should never be done at all
Thinking of the pipeline as "a sequence of transformations on a list of tokens" is useful mainly because:
It proves that NLP requires no programming
Once text is tokenized, each later step is just a filter or map over a list — a pattern you already know — so the difficulty shifts from coding the steps to choosing them
It means the order of operations is irrelevant
It guarantees every pipeline gives the same output
With the whole assembly line in view, we can now study each station properly. We start where the pipeline starts: tokenization — and why breaking text into words and sentences is far subtler than it looks.
What Is Natural Language Processing?
Why human text is genuinely hard for computers — lexical, structural, and referential ambiguity — and what NLP is really trying to do. We build the core intuition that text must be turned into structure before a program can use it.
Sentence and Word Tokenization
Why breaking text into words and sentences is a real linguistic problem, not a call to .split(). How word_tokenize handles contractions and punctuation, how sentence tokenization survives 'Dr.' and decimals, and when tokenization choices matter downstream.