Part-of-Speech (POS) Tagging
Labeling each word with its grammatical role — noun, verb, adjective — using NLTK's pos_tag. Why context decides the tag, the Penn Treebank tagset, building syntactic patterns, and using POS tags to lemmatize accurately.
So far every step has treated words as interchangeable atoms: a token is a token, counted and filtered without regard to its role in the sentence. But the same word can play different roles. In "I caught a fish", "fish" is a noun. In "We fish every weekend", "fish" is a verb. To a frequency counter these are identical; grammatically they could not be more different.
Part-of-speech (POS) tagging assigns each word its grammatical category — noun, verb, adjective, adverb, and so on. It is the first step that pays attention to syntax rather than just the word's spelling, and it unlocks a lot: better lemmatization, extracting "all the nouns", recognizing names, and finding grammatical patterns.
The problem POS tagging solves: words wear many hats
A word's category depends on context, not just the word itself. This is why POS tagging cannot be a simple dictionary lookup — the same spelling needs different tags in different sentences.
A tagger must read the surrounding words to decide. NLTK's default tagger, the averaged perceptron tagger, does exactly that: it was trained on a large hand-tagged corpus and predicts each word's tag from features of the word and its neighbors. Let us watch it disambiguate "fish".
Same word, two sentences, two different tags — NN (noun) in the first,
VBP (a verb tag) in the second. The tagger got there from context. This is
the reason POS tagging is more than a lookup table.
Reading the tags: the Penn Treebank tagset
pos_tag returns a list of (word, tag) tuples. The tags come from the
Penn Treebank tagset, which is more fine-grained than "noun/verb" — it
distinguishes singular from plural nouns, verb tenses, and so on. You do not
need to memorize all ~36 tags, but you should recognize the common families.
| Tag | Meaning | Example |
|---|---|---|
NN / NNS | noun, singular / plural | dog / dogs |
NNP / NNPS | proper noun, singular / plural | Alice / Americas |
VB / VBD / VBG / VBZ | verb: base / past / gerund / 3rd-person | run / ran / running / runs |
JJ / JJR / JJS | adjective / comparative / superlative | big / bigger / biggest |
RB | adverb | quickly |
DT | determiner | the, a |
IN | preposition / subordinating conjunction | in, of |
PRP | personal pronoun | I, you, it |
CC | coordinating conjunction | and, but |
The first letter is the shortcut
You can decode most tags from their first letter: N… is a noun, V… is a
verb, J… is an adjective, R… is an adverb. This is so useful that we will
use it in code in a moment to convert Penn tags into the simpler categories a
lemmatizer wants. When you only care about coarse categories, tag[0] or
tag.startswith("NN") is often all you need.
Here is a full sentence tagged, laid out as a tree of word-to-tag mappings.
A real payoff: extracting syntactic patterns
Once words carry tags, you can pull out grammatical patterns cheaply. Want
every noun in a document (to guess what it is about)? Keep tokens whose tag
starts with NN. Want adjectives (useful for opinion mining)? Keep JJ. This
"filter by tag" move is the backbone of simple information extraction.
POS tagging makes lemmatization accurate
Remember the big lemmatization gotcha: WordNetLemmatizer assumes every word
is a noun unless told otherwise, so verbs come back unchanged. POS tagging is
the missing piece. Tag first, convert each Penn tag to the coarse category
WordNet understands, then lemmatize with that category. This is the standard,
accurate lemmatization recipe.
Compare the two lines. Without POS, "running" and "eaten" survive unchanged. With POS, "running" → "run" and "eaten" → "eat". The tagger supplied the context the lemmatizer needed. This tag-then-lemmatize pattern is worth committing to memory — it is how accurate normalization is done in practice.
Tagging is contextual, and not perfect
Because the tagger predicts from context, it can be wrong — especially on short, ungrammatical, or unusual text (headlines, tweets, product titles). It is right the large majority of the time on well-formed English, but treat its output as a strong guess, not gospel. Lowercasing or stripping punctuation before tagging tends to hurt accuracy, since the tagger uses capitalization and punctuation as clues — another reason to tag relatively early.
Why can't part-of-speech tagging be done with a simple dictionary that maps each word to one fixed tag?
Dictionaries are too slow to look words up in
A word's part of speech depends on context — "fish" is a noun in "I caught a fish" but a verb in "we fish on weekends" — so the same word needs different tags in different sentences
Every word in English has exactly one possible part of speech
Tagging only works on numbers
Real-world uses of POS tags
- Accurate lemmatization (as above) — the most common pairing.
- Named-entity recognition builds on proper-noun tags (
NNP) to find people, places, and organizations. - Information extraction: pull all noun phrases to summarize "what" a document discusses, or adjective–noun pairs for opinion mining ("great battery", "slow service").
- Grammar and writing tools flag, e.g., a sentence with no verb.
- Search: knowing a query word is a verb vs. a noun can sharpen results.
Your turn: extract the nouns
Write a function get_nouns(text) that returns a list of the words in
text that are tagged as nouns — that is, words whose Penn Treebank tag
starts with "NN" (this covers NN, NNS, NNP, and NNPS, so both
common and proper nouns count).
Steps: word-tokenize the text, run pos_tag on the tokens, then keep the
words whose tag starts with "NN".
For example, get_nouns("The hungry cat chased a small mouse.") should return
["cat", "mouse"].
Check your understanding
What does pos_tag return when given a list of tokens?
A single string naming the sentence's overall grammar
A list of (word, tag) tuples, one per token, where the tag is the word's part of speech in context
A list of only the nouns
The lemmatized form of each word
In the Penn Treebank tagset, which tags would the filter tag.startswith("NN")
match?
Only NN (singular common nouns)
NN, NNS, NNP, and NNPS — singular and plural common nouns and proper nouns
All verbs
Determiners and prepositions
Why does tagging before lemmatizing produce better results than lemmatizing alone?
Tagging makes the text shorter
The tag tells the lemmatizer each word's part of speech, so verbs lemmatize as verbs ("running" → "run") instead of being left unchanged under the default noun assumption
Lemmatizing first would delete all the verbs
Tagging removes stopwords automatically
A teammate runs the tagger on a batch of all-lowercase, punctuation-stripped product titles and gets noticeably worse tags than on clean sentences. What is the most likely reason?
The tagger only works in the morning
The tagger uses capitalization and punctuation as contextual clues; stripping them away beforehand removes signal the model relies on, lowering accuracy
POS tagging cannot be applied to product titles at all
The titles need to be translated first
You can now label words by grammatical role. Notice a limitation though: every step so far treats words individually. But "New York", "machine learning", and "not good" are multi-word units whose meaning lives in the combination. To capture local word order, we turn to n-grams.
Word Frequencies and Lexical Diversity
Once text is clean tokens, the first thing you measure is how often each word appears. NLTK's FreqDist, .most_common(), lexical diversity, hapaxes, and the long-tailed shape of language — plus why raw frequency alone is dominated by stopwords.
N-grams: Bigrams and Trigrams
Single tokens throw away word order, but meaning often lives in word combinations. N-grams capture local context by sliding a window over the tokens. Generating bigrams and trigrams with nltk.util.ngrams, counting phrases, and the sparsity trade-off as n grows.