Designing the Right Preprocessing Pipeline
The most important skill in classic NLP — choosing which preprocessing steps to apply. There is no universal pipeline; each step adds or destroys information, and the right choice depends entirely on the downstream task. A decision framework, a task-by-task table, and how to evaluate choices empirically.
If you remember one page from this course, make it this one. Every previous page taught a technique. This page teaches the judgment to combine those techniques well — and that judgment, far more than knowing the function names, is what separates someone who can run NLTK from someone who can build a text system that works.
The central truth is uncomfortable for beginners who want a recipe: there is no universal "correct" pipeline. Lowercasing, stopword removal, stemming, lemmatization, n-grams — each is a choice, and a step that sharpens one task will quietly sabotage another. Choosing well means understanding what each step does to your data and matching it to what your task needs.
The pipeline is a menu, not a fixed recipe
Picture the steps as a menu of optional transformations. Only tokenization is nearly mandatory; everything after it is a decision.
Each diamond is a question with no default answer. To answer it, apply a single guiding principle.
The one question to ask at every step
For each optional step, ask: "Does the information this step changes or destroys matter to my task?" Lowercasing destroys capitalization — does your task need it (names, acronyms)? Stopword removal destroys function words — does your task need them (negation, phrases, author style)? If the destroyed information is noise for your task, apply the step. If it is signal, skip it. That single question resolves most pipeline decisions.
A decision framework
Two of the highest-stakes choices can be captured in a small decision tree.
This does not cover every case, but it captures the two decisions people get wrong most often: stripping negation before sentiment, and reaching for stemming when meaning matters.
Task by task
The clearest way to internalize "it depends" is to walk through real tasks and see the opposite choices they demand.
Search indexing. You want a query for "running" to match "ran" and "runs". Lowercase (case is noise for matching), remove stopwords (they bloat the index and rarely help relevance), and stem (fast, and a stem is only ever compared to other stems). Readability does not matter — no human sees the index.
Topic analysis / keyword extraction. You want the content words that say what a document is about. Lowercase, remove stopwords (they drown out content), and lemmatize so the keywords you surface are real words a person can read.
Sentiment analysis. Now the calculus flips. Do not strip stopwords naively — negations ("not", "no", "nor") and contrasts ("but") live there and reverse meaning. Lowercasing is usually fine but ALL-CAPS can be a signal worth keeping. Punctuation like "!" can carry intensity. And bigrams help a lot, because "not good" must survive as a unit. The aggressive cleaning that helps topic analysis is exactly wrong here.
Authorship attribution. The most counterintuitive case. Here the stopwords are the signal — an author's unconscious rate of "the", "of", "that" is a fingerprint — and the content words are noise. So you keep stopwords and often discard content words: the inverse of every other task.
Named-entity recognition. Capitalization is the main clue that "Apple" is a company. Do not lowercase, do not strip stopwords (they delimit entities), and tag parts of speech before any normalization.
Here is the same advice as a table. Read across each row and notice that no two columns agree.
| Task | Lowercase? | Remove stopwords? | Stem / Lemmatize? | N-grams? |
|---|---|---|---|---|
| Search indexing | yes | yes | stem | sometimes |
| Topic / keywords | yes | yes | lemmatize | sometimes |
| Sentiment analysis | careful | no (keep negation) | light, if any | bigrams help |
| Authorship attribution | no | no (it's the signal) | no | sometimes |
| Named-entity recognition | no | no | no | n/a |
The cells disagree everywhere. That disagreement is the lesson: preprocessing must be designed for the task, not applied by reflex.
Seeing it: one text, two pipelines
Let us run a single sentence through a "topic" pipeline and a "sentiment" pipeline and watch them produce deliberately different results.
Look at the difference. The topic pipeline threw away "not" and "but" — fine,
because for "what is this review about?" the answer is "plot, acting". But the
sentiment pipeline kept them and added the bigram ("not", "good"), because
for "how does the reviewer feel?" those words are everything. Same input,
deliberately different preprocessing, because the tasks need different
information preserved.
The cardinal sin: copy-pasting someone else's pipeline
The most common real-world NLP mistake is lifting a preprocessing snippet from a blog post — usually "lowercase, remove stopwords, stem" — and applying it to a task it was never meant for. That exact recipe is great for search and disastrous for sentiment. Always re-derive the pipeline from your task, even if the code looks boilerplate.
How to actually decide: evaluate, do not guess
The framework above gets you a strong starting point, but the honest answer to "should I remove stopwords here?" is often try it both ways and measure on your task. This is called an ablation: hold everything else fixed, toggle one step, and compare results on a metric you care about.
The key discipline is to change one step at a time so you can attribute any change in the result to that step. If you flip three options at once and the score moves, you have learned nothing about which one mattered. Disciplined, one-at-a-time evaluation turns pipeline design from superstition into engineering.
Defaults are a starting point, not an answer
Reasonable defaults for an unknown task: tokenize, lowercase, and keep stopwords until you have a reason to drop them (dropping is the riskier, information-destroying move). Then evaluate each additional step against your metric. Starting conservative and adding aggression only when it earns its keep is safer than starting aggressive and hoping.
What is the single most useful question to ask when deciding whether to apply a preprocessing step?
"Is this step in the most popular tutorial?"
"Does the information this step adds or destroys matter to my specific task?"
"Is this step the fastest to run?"
"Does this step reduce the number of tokens?"
Your turn: a configurable, task-aware step
The cleanest way to respect "it depends" in code is to make a step configurable and choose the setting per task. Implement a preprocessing function whose stopword removal can be turned off — so the same function serves both a topic pipeline (remove stopwords) and a sentiment pipeline (keep them).
Write preprocess(text, remove_stopwords) that:
- Word-tokenizes
text. - Lowercases each token and keeps only alphabetic tokens.
- If
remove_stopwordsisTrue, also removes English stopwords; if it isFalse, leaves all words in.
This single switch lets the same function serve a topic pipeline
(remove_stopwords=True) and a sentiment pipeline
(remove_stopwords=False, so negations survive).
Examples:
preprocess("This is NOT good", True)->["good"]preprocess("This is NOT good", False)->["this", "is", "not", "good"]
Check your understanding
Why is the "standard" recipe lowercase → remove stopwords → stem a poor default for sentiment analysis?
It is too slow for product reviews
Stopword removal deletes negations like "not", and stemming/over-cleaning erodes the phrase-level cues sentiment depends on, so a negative review can look positive
Stemming is not available in NLTK
Lowercasing is never allowed
For authorship attribution (deciding who wrote an anonymous text), how should you treat stopwords, and why?
Remove them, because they never carry useful information
Keep them — an author's habitual use of function words ("the", "of", "that") is a distinctive fingerprint, so here the stopwords are the signal and content words are closer to noise
Replace each stopword with a random word
Lowercase them but keep everything else uppercase
You want to know empirically whether removing stopwords helps your classifier. What is the disciplined way to find out?
Flip stopword removal, lemmatization, and n-grams all at once and see if the score changes
Run an ablation: hold everything else fixed, toggle only stopword removal, and compare your task metric with vs. without — so any change is attributable to that one step
Ask which option is most popular online and use that
Always remove stopwords; measuring is unnecessary
A reasonable conservative default for a brand-new task you do not yet understand is to:
Apply every preprocessing step as aggressively as possible
Tokenize, lowercase, and keep stopwords (avoid the information-destroying steps) until evaluation shows a step earns its place
Skip tokenization to save time
Remove all words and keep only punctuation
Which statement is the truest summary of preprocessing design?
There is one correct pipeline that works for every NLP task
The right pipeline depends on the downstream task; each step trades information away or adds it, and good design matches those trades to what the task needs
Preprocessing never affects results, so any pipeline is fine
More preprocessing always produces better results
You now know how to choose your steps. The last two pages turn clean tokens into something a model can consume — first by converting text into numeric features (bag-of-words), then by assembling everything into a working rule-based sentiment classifier.
N-grams: Bigrams and Trigrams
Single tokens throw away word order, but meaning often lives in word combinations. N-grams capture local context by sliding a window over the tokens. Generating bigrams and trigrams with nltk.util.ngrams, counting phrases, and the sparsity trade-off as n grows.
From Text to Features: Bag-of-Words
Algorithms need numbers, not words. The bag-of-words model turns each document into a vector of word counts over a shared vocabulary. Building a document-term matrix by hand, binary vs. count features, and the limitations (lost order, sparsity, no semantics) that motivate n-grams and beyond.