Identifying and Removing Stopwords

Some words are everywhere. In almost any English text, "the", "is", "of", "and", and "to" are among the most frequent words — and they are also among the least informative about what the text is about. These ultra-common, low-content words are called stopwords, and deciding whether to remove them is one of the most consequential — and most frequently botched — choices in a text pipeline.

This page has a split personality on purpose. The first half shows you why removing stopwords is useful. The second half shows you a task where removing them is a disaster. Holding both in your head is the whole point.

What are stopwords, and why remove them?

A stopword is a word so common and grammatically functional that it carries little meaning on its own. NLTK ships a ready-made English list. Let us look at it.

The list is short — under 200 words — but those words make up a huge fraction of any text. Removing them has two appealing effects: it shrinks your data, and it concentrates attention on content words (nouns, verbs, adjectives) that actually signal the topic.

Here is the filter in action. The pattern is always the same: build a set of stopwords for fast lookup, then keep only the tokens that are not in it.

The content words — "quick", "brown", "fox", "jumps", "lazy", "dog", "park" — tell you what the sentence is about. The stopwords — "the", "over", "in" — mostly just hold the grammar together. For a task like topic detection or building a search index, dropping them is a clear win.

Why a set, not a list?

Checking token in stop is done once per token, potentially millions of times. Membership testing in a Python set is roughly constant time, while a list scans every element. Converting the stopword list to a set once, up front, can make the filtering step dramatically faster on large texts. It is a small habit that scales well.

The negation trap: when removing stopwords is a disaster

Now the other personality. Look closely at what is in the stopword list.

The negation words — "not", "no", "nor" — are stopwords. So are many negated contractions. These are the words that reverse meaning, and the standard stoplist deletes them. Watch what that does to a negative product review.

The review "not good" became simply "good". A sentiment system reading the filtered tokens would conclude the reviewer liked the movie — the exact opposite of the truth. This is not a contrived edge case; negation is everywhere in opinionated text, and blindly removing stopwords before sentiment analysis is a classic, costly mistake.

Do not remove stopwords before sentiment analysis (at least not naively)

The standard stopword list contains the very words that carry sentiment structure: negations ("not", "no", "nor"), contrasts ("but"), and intensity ("very" — also a stopword). Strip them and you can flip a review's meaning. For sentiment, either keep stopwords, or use a custom list that preserves negation and contrast words. The same caution applies to any task where small function words change meaning.

More tasks where stopwords are precious

Sentiment is not the only place the default list hurts. Stopwords are essential — sometimes they are the entire signal — in several tasks:

Phrases and idioms. "To be or not to be" is all stopwords. Remove them and the most famous line in English vanishes into nothing.
Machine translation. Function words encode tense, number, and relationships. You cannot translate by throwing them away.
Question answering. "Who" vs "whom", "can" vs "cannot" — the small words often are the question.
Authorship attribution. Remarkably, how an author uses stopwords (their unconscious rate of "the", "of", "that") is a fingerprint used to identify who wrote a text. Here the stopwords are the data and the content words are the noise — the exact inverse of topic analysis.

Two more misconceptions to retire

"Always remove stopwords." No — it is a task-dependent choice, and for a large class of tasks it is harmful. "The stopword list is universal." Also no. There is no single official list; NLTK's differs from spaCy's and from scikit-learn's, and the right list for your domain often needs customizing. In a corpus of movie reviews, the word "movie" is so ubiquitous it acts like a stopword; in legal text, "herein" and "pursuant" might. Good practitioners curate the list for the job.

Customizing the stoplist

Because the list is just a Python collection, you can add domain-specific noise words and — crucially — remove words you must keep.

This three-line recipe — start from the base list, subtract the words you must protect, add the words specific to your domain — is how stopword removal is done responsibly in practice.

QuestionSelect one

A sentiment analysis pipeline removes NLTK's default English stopwords before scoring reviews. Reviews like "this is not good" keep getting classified as positive. What is the root cause?

The reviews are too short to classify

"not" is on the stopword list, so removing stopwords deletes the negation and "not good" collapses to "good", flipping the sentiment

Sentiment analysis is impossible in Python

The pipeline forgot to lowercase the text

Your turn: filter stopwords without losing negation

Write a function clean_keep_negation(tokens) that removes English stopwords from a list of already-lowercased tokens, but preserves these negation/contrast words so sentiment survives: not, no, nor, never, but.

Build a custom stop set by starting from stopwords.words("english") and subtracting those five words, then keep only tokens not in that custom set.

For example, given ["this", "is", "not", "good"], the function should return ["not", "good"] — ordinary stopwords gone, the negation kept.

Check your understanding

QuestionSelect one

What best describes a stopword?

A word that signals the end of a sentence

A very common, low-content word (like "the", "is", "of") that appears across nearly all texts and carries little topical meaning on its own

A misspelled word that should be corrected

A word that must always be removed from any text

QuestionSelect one

For which task is removing stopwords most clearly beneficial?

Building a search index where you want to match on meaningful content words

Analyzing the sentiment of opinionated product reviews

Translating a sentence from English to French

Identifying which author wrote an anonymous text

QuestionSelect one

Why is "the stopword list is universal and official" a misconception?

Because stopword lists change every day

Because there is no single canonical list — different libraries ship different lists, and the right list depends on your domain (a movie corpus might treat "movie" as a stopword)

Because stopwords do not exist in English

Because only nouns can be stopwords

QuestionSelect one

You must remove ordinary stopwords from movie reviews but protect negation so sentiment survives. Which approach is correct?

Remove every stopword, then add "not" back to random positions

Start from the base stoplist, subtract the negation/contrast words you must keep, and filter tokens against that customized set

Skip tokenization so stopwords never appear

Convert the reviews to uppercase first

QuestionSelect one

Converting the stopword list to a Python set before filtering mainly improves:

The accuracy of the filtering

The speed, because membership tests (token in stop) are roughly constant-time in a set versus a linear scan in a list

The number of stopwords in the list

The language of the text

Stopword removal trims the common words. The next page tackles a different kind of redundancy: the fact that "run", "runs", "running", and "ran" are all the same underlying word wearing different endings. That is the job of stemming and lemmatization.

What are stopwords, and why remove them?

The negation trap: when removing stopwords is a disaster

More tasks where stopwords are precious

Customizing the stoplist

Your turn: filter stopwords without losing negation

Check your understanding

Identifying and Removing Stopwords

What are stopwords, and why remove them?

The negation trap: when removing stopwords is a disaster

More tasks where stopwords are precious

Customizing the stoplist

Your turn: filter stopwords without losing negation

Check your understanding

On this page