Dataslope logoDataslope

The Anatomy of an NLP Pipeline

The standard sequence of preprocessing steps text flows through — tokenize, normalize, filter, reduce, analyze — seen as one picture. Why it is a pipeline of transformations, why most steps are optional, and why order matters.

On the last page we said NLP adds structure to text in layers. Those layers are usually arranged as a pipeline: a fixed sequence of steps where the output of each step becomes the input to the next. Before we spend a page each on the individual steps, it is worth seeing the whole assembly line at once, because the shape will repeat on nearly every page that follows.

The standard text preprocessing pipeline

Here is the canonical flow. Almost every classic NLP project is some subset of these steps, in roughly this order.

Read it top to bottom. Raw text comes in; a clean, normalized list of units comes out, ready to be counted or fed into a model. Three things about this picture matter more than the picture itself.

First, it is a pipeline of transformations. At each step you are holding a value — first a string, then a list of tokens — and you apply one transformation to get the next value. This is exactly the list-comprehension style of Python you already know: tokens = [t.lower() for t in tokens] is the case-folding step.

Second, most steps are optional. Only tokenization is nearly always present. Whether you lowercase, whether you remove stopwords, whether you stem or lemmatize — those are choices, and making them well is the single most important skill in classic NLP. We give that skill its own page later (Designing the Right Pipeline).

Third, order matters. Removing punctuation before tokenizing can destroy information a tokenizer needs (those periods help it find sentence ends). Lowercasing before part-of-speech tagging can confuse a tagger that uses capitalization as a clue. The same steps in a different order can give different — sometimes worse — results.

The pipeline is a sequence of transformations on a list

After tokenization, your text is just a Python list of strings, and every later step is a transformation of that list into another list: filter out some items, lowercase the rest, replace each with its root. If you are comfortable with list comprehensions, you already understand the mechanics of an NLP pipeline. The hard part is never the code — it is deciding which transformations to apply.

Watching a list flow through the pipeline

Let us make "a pipeline is a sequence of list transformations" concrete. The block below runs the standard steps and prints the list after each one, so you can watch it shrink and change. Read each printed line against the diagram above.

Code Block
Python 3.13.2

Each print is one row of the pipeline diagram. The list started as nine tokens including punctuation and shrank to a handful of meaningful roots. Try commenting out step 4 (stopword removal) and re-running — notice how much longer the final list becomes, and ask yourself whether those extra words would help or hurt whatever you planned to do next. That question has no universal answer, which is exactly why these steps are choices.

Experiment: reorder the steps

Move the lowercasing step (step 2) to after stopword removal and re-run. The stopword list is all lowercase, so comparing un-lowercased tokens like "The" against it silently fails to remove them. This is a tiny taste of why order matters — a theme we will keep returning to.

The two families of steps: cleaning vs. analysis

It helps to mentally split the pipeline into two halves.

The preprocessing half turns raw text into clean, comparable units. It is mostly mechanical and mostly the same from project to project. The analysis half is where you actually extract value, and it depends on your goal: a search engine counts and indexes, an autocomplete builds n-grams, a classifier builds features. Crucially, the analysis you intend to do should drive the preprocessing choices you make — not the other way around. We will see this dependency again and again.

QuestionSelect one

In the pipeline, why is it accurate to say "most steps are optional, but tokenization usually is not"?

Tokenization is the only step that is fast enough to run

Almost every later step operates on tokens, so you need tokens before you can lowercase, filter, count, or tag them — whereas lowercasing, stopword removal, and stemming are choices that depend on the task

Tokenization is the last thing you do, after counting

Stopword removal must always happen before tokenization

Order matters: a concrete example

We just hinted at it; let us prove it. The block below runs the same three operations — lowercase, remove stopwords, keep alphabetic — but in two different orders, and compares the results.

Code Block
Python 3.13.2

Order A correctly removes all three "the"s, because by the time we compare against the lowercase stopword list, every token is lowercase. Order B leaves "The" and "THE" behind, because they did not match the lowercase "the" in the stoplist at the moment we filtered. Same operations, different order, different — and in this case wrong — result.

Order bugs are silent

Notice that Order B did not crash. It produced a perfectly reasonable-looking list that happened to be wrong. Pipeline ordering bugs almost never raise an error; they quietly corrupt your data and you only notice when your downstream results look off. Reasoning carefully about order is a real part of the craft.

Your turn: assemble a mini pipeline

Time to build the cleaning half yourself. The challenge below asks you to write a preprocess function that runs the standard cleaning steps in the right order. The hidden tests check the output on a couple of sentences, including one where a stopword changes everything.

Challenge
Python 3.13.2
Build a text-cleaning pipeline

Write a function preprocess(text) that returns a list of clean tokens by running these steps in this order:

  1. Word-tokenize the text with word_tokenize.
  2. Lowercase every token.
  3. Keep only tokens that are fully alphabetic (use .isalpha() to drop punctuation and numbers).
  4. Remove English stopwords (use the provided stop set).

Return the resulting list. For example, preprocess("The quick brown foxes!") should return ['quick', 'brown', 'foxes'].

The word_tokenize function and the stop set are already available from the setup section.

Notice what the second test taught you

preprocess("It is NOT a good day.") returned ['good', 'day']. The word "not" vanished — it is on NLTK's stopword list — and with it, the entire meaning of the sentence flipped from negative to positive. That is a preview of one of the most important cautionary tales in this course, which we will tell in full on the stopwords page. Removing stopwords is not free; sometimes it throws away the very words that matter.

Check your understanding

QuestionSelect one

Why can applying the same preprocessing steps in a different order produce different results?

It cannot — order never affects the outcome of preprocessing

Each step transforms the data the next step sees, so (for example) filtering before lowercasing compares differently-cased tokens against a lowercase stopword list and misses some

Reordering steps changes which version of Python runs

Only the first step has any effect; the rest are ignored

QuestionSelect one

A colleague says, "I always remove stopwords and stem every text, no matter the project — it's just good hygiene." What is the best response?

They are right; more preprocessing is always better

Preprocessing steps are choices that should be driven by the downstream task; some steps that help one task (like topic counting) actively hurt another (like sentiment analysis)

They are right, because stemming and stopword removal are required by NLTK

They are wrong because preprocessing should never be done at all

QuestionSelect one

Thinking of the pipeline as "a sequence of transformations on a list of tokens" is useful mainly because:

It proves that NLP requires no programming

Once text is tokenized, each later step is just a filter or map over a list — a pattern you already know — so the difficulty shifts from coding the steps to choosing them

It means the order of operations is irrelevant

It guarantees every pipeline gives the same output

With the whole assembly line in view, we can now study each station properly. We start where the pipeline starts: tokenization — and why breaking text into words and sentences is far subtler than it looks.

On this page