Dataslope logoDataslope

Stemming vs. Lemmatization

Two ways to collapse 'run', 'runs', 'running', and 'ran' into one root. Stemming chops suffixes with fast rules and may produce non-words; lemmatization looks up real dictionary forms but needs part-of-speech. Which to choose, and why lemmatization wins for meaning.

Tokenization and normalization handle spelling variation. But there is a deeper kind of redundancy in language: the same word appears in many grammatical forms. "run", "runs", "running", and "ran" are four surface forms of one underlying word. "study", "studies", "studied", and "studying" are four forms of another. If you are counting words or matching a search query, you usually want all forms of a word to collapse into one.

There are two classic techniques for that collapse, and the difference between them — in mechanism, in output, and in when to use them — is one of the most clarifying things you can learn in NLP.

The shared goal: reduce words to a root

Both techniques map many forms to one. The disagreement is entirely about how, and that "how" has big consequences.

Stemming is crude and mechanical: it strips suffixes according to fixed rules. It is fast, needs no dictionary, and frequently produces fragments that are not real words ("studi", "happi"). Lemmatization is principled: it looks the word up in a vocabulary and returns its lemma — the proper dictionary headword ("study", "happy"). It is slower, needs a dictionary, and works best when you tell it the word's part of speech.

Stemming: chopping with the Porter algorithm

The most famous stemmer is the Porter stemmer, a set of suffix-stripping rules from 1980 that is still everywhere. Watch it work — and watch it misbehave.

Code Block
Python 3.13.2

Read that output carefully — it teaches stemming's whole personality:

  • It works on regular forms. "running" and "runs" both stem to "run". "studies" and "studying" both stem to "studi".
  • It misses irregulars. "ran" stays "ran" — the rules only chop suffixes, they do not know that "ran" is the past tense of "run".
  • Its output is often not a real word. "studies" → "studi", "happily" → "happili", "happiness" → "happi". These stems are fine as internal keys but useless if you need to show them to a human or look them up.
  • It can over-stem. Notice "organization", "organize", and "organ" can collapse toward the same short stem, conflating words with quite different meanings. Crushing distinct words together is called over-stemming.

A stem is a key, not a word

The right mental model: a stem is an opaque identifier that groups related forms, not a meaningful word. "studi" is a perfectly good grouping key — every form of "study" maps to it consistently — but it is not English. As long as you only ever compare stems to other stems (does the query stem match the document stem?), the fact that they are not real words does not matter.

Lemmatization: looking up the dictionary form

Lemmatization returns the lemma — the canonical dictionary form. NLTK's WordNetLemmatizer uses the WordNet lexical database to do it. The result is always a real word, but there is a catch you must understand.

Code Block
Python 3.13.2

This is the most important — and most surprising — fact about WordNetLemmatizer: by default it assumes every word is a noun. Because "running" can be a noun ("the running of the race"), lemmatize("running") returns "running" unchanged. Only when you pass pos="v" does it know to treat it as a verb and return "run".

With the right part of speech, look how good it is: "ran" → "run", "was" → "be", "better" → "good". A stemmer could never produce "be" from "was" or "good" from "better", because those are not suffix operations — they require knowing the word.

The number-one lemmatization bug

WordNetLemmatizer.lemmatize(word) with no pos argument treats word as a noun, so most verbs and adjectives come back unchanged and beginners conclude "lemmatization doesn't do anything." It does — you just have to tell it the part of speech. This is exactly why lemmatization and part-of-speech tagging (the next page but one) are natural partners: tag first to learn each word's POS, then lemmatize with that POS for accurate results.

Stemming vs. lemmatization, side by side

Let us run both on the same words so the trade-off is unmistakable.

Code Block
Python 3.13.2

Notice "happily" stems to the non-word "happili" but you would need an adjective/adverb lemma to handle it well; "was" stems to "wa" (nonsense) but lemmatizes to "be"; "mice" and "feet" defeat the stemmer entirely (suffix rules cannot turn "mice" into "mouse") while a lemmatizer with a dictionary can. Stemming is fast and approximate; lemmatization is slower and correct.

QuestionSelect one

Why does WordNetLemmatizer().lemmatize("running") return "running" unchanged, while lemmatize("running", pos="v") returns "run"?

Because "running" has no lemma

The lemmatizer defaults to treating the word as a noun, and "running" is a valid noun, so it is left as-is; supplying pos="v" tells it to treat the word as a verb, whose lemma is "run"

It is a bug in NLTK

"running" and "run" are unrelated words

When to use which

Reach for stemming when you are processing huge volumes and you only ever compare stems to stems — classically, a search engine. If a user searches "running shoes", stemming both the query and the documents to "run" lets the search match "runs", "ran", and "running" alike. Nobody ever sees the stem "run" rendered on screen, so it does not matter that some stems are not words.

Reach for lemmatization when the root must be a real word or meaning matters: building features for careful analysis, normalizing text a human will read, or any semantic application where "good" and "better" should unify and "be" should capture "was/is/were/been". The cost is speed and the need for a dictionary (and, ideally, POS tags), but the output is principled.

The one-line summary

Stemming is fast, dictionary-free suffix-chopping that may produce non-words — great for search and recall at scale. Lemmatization is slower, dictionary-backed, POS-aware reduction to real words — better whenever meaning or readability matters. When in doubt for a semantic task, prefer lemmatization.

Common misconceptions

  • "They are the same thing." No. Different mechanism (rules vs. lookup), different output (fragments vs. real words), different cost.
  • "Stems are real words." Often not — "studi", "happi", "wa". A stem is a grouping key, not a vocabulary word.
  • "Lemmatization always changes the word." No — with the default noun POS, many words come back unchanged. You frequently need POS for it to do its job.
  • "More aggressive reduction is always better." No — over-stemming merges unrelated words ("organ" and "organization"), which can hurt as much as it helps.

Your turn: lemmatize verbs correctly

Challenge
Python 3.13.2
Lemmatize a list of verbs

Write a function verb_lemmas(words) that lemmatizes each word as a verb and returns the list of lemmas. Use the provided lemmatizer and pass pos="v" so verbs are reduced correctly.

For example, verb_lemmas(["running", "ran", "studies"]) should return ["run", "run", "study"]. Notice the irregular "ran" correctly becomes "run" — something a stemmer cannot do.

Check your understanding

QuestionSelect one

Which statement correctly contrasts stemming and lemmatization?

Stemming uses a dictionary; lemmatization uses suffix rules

Stemming chops suffixes with fixed rules and may output non-words; lemmatization looks words up in a vocabulary and returns real dictionary forms (and benefits from knowing the part of speech)

They are identical and interchangeable

Lemmatization is always faster than stemming

QuestionSelect one

A search engine team wants queries like "running shoes" to also match pages containing "ran" and "runs". Speed at massive scale matters; the reduced form is never shown to users. Which technique fits best?

Stemming — fast, dictionary-free, and fine that the stem may not be a real word, since it is only ever compared to other stems

Lemmatization, because real words are required for matching

Neither; search cannot use root reduction

Both must always be applied together

QuestionSelect one

Why can a lemmatizer turn "was" into "be" while a stemmer cannot?

The stemmer is broken

"was" → "be" is an irregular mapping that requires knowing the word (a dictionary lookup), not stripping a suffix; stemmers only remove endings

Stemmers only work on nouns

"was" and "be" are unrelated

QuestionSelect one

What is over-stemming?

Running the stemmer more than once

When the stemmer reduces distinct words (like "organ" and "organization") to the same stem, wrongly merging different meanings

When a stem is longer than the original word

When lemmatization fails to change a word

You now know how to collapse word forms to a root. Before we can lemmatize accurately, though, we need to know each word's part of speech — and that is coming up. But first, let us put your clean tokens to work and actually measure a text: how many distinct words, which are most frequent, and how varied the vocabulary is.

On this page