N-grams: Bigrams and Trigrams
Single tokens throw away word order, but meaning often lives in word combinations. N-grams capture local context by sliding a window over the tokens. Generating bigrams and trigrams with nltk.util.ngrams, counting phrases, and the sparsity trade-off as n grows.
Every analysis step so far has a quiet blind spot: it treats words individually. A bag of words knows the document contains "not" and "good" but has no idea they sat next to each other as "not good". It sees "New" and "York" but not "New York". Word order carries meaning, and to capture even a little of it we use n-grams.
An n-gram is a contiguous sequence of n tokens. A 1-gram (unigram)
is a single word. A 2-gram (bigram) is a pair of adjacent words. A 3-gram
(trigram) is a run of three. By looking at adjacent groups instead of lone
words, n-grams capture local context that single tokens lose.
The sliding window
The mental image is a window of width n that slides along the token list one
step at a time. Each position is one n-gram, and consecutive windows
overlap.
Notice the overlap: "quick" appears in both window 1 and window 2. That is
deliberate — overlapping windows ensure every adjacent pair is captured. For a
list of k tokens there are k - n + 1 n-grams (here, 4 - 2 + 1 = 3
bigrams).
Generating n-grams with nltk.util.ngrams
NLTK gives you ngrams(tokens, n), which yields the windows as tuples. It
returns a generator, so wrap it in list(...) to see or reuse the results.
Each bigram is a 2-tuple of adjacent words; each trigram is a 3-tuple. The
counts match the formula: 5 tokens give 5 - 2 + 1 = 4 bigrams and
5 - 3 + 1 = 3 trigrams.
Several ways to the same n-grams
nltk.util.ngrams(tokens, n) is the general tool — pass any n. NLTK also
offers the convenience shortcuts nltk.bigrams(tokens) and
nltk.trigrams(tokens) for the two most common cases. All three return
generators of tuples, so wrap them in list() to materialize them. We use
ngrams(tokens, n) here because it makes the role of n explicit.
Why n-grams matter: order is meaning
Two quick demonstrations of what unigrams miss and bigrams catch.
The bigram ("not", "good") is a concrete, countable feature that a sentiment
model can learn is negative — something no single-word feature can express.
This is one of the simplest, most effective upgrades to a bag-of-words model:
add bigrams so that negations and key phrases survive.
Counting n-grams reveals phrases
Counting n-grams (with FreqDist or Counter, just like words) surfaces the
common phrases in a text — the building blocks of autocomplete, phrase
search, and collocation discovery.
"machine learning" rises to the top as the most frequent bigram — the counter discovered a meaningful two-word phrase purely from co-occurrence. This is the seed of collocation detection (finding word pairs that go together more than chance would predict) and of next-word prediction: given "machine", the data suggests "learning" is a likely follow-up.
N-grams are the intuition behind autocomplete
When your phone suggests the next word, a classic approach is an n-gram language model: count which word most often follows the previous one or two words, and suggest that. "United" is often followed by "States"; "machine" by "learning". You are not building a full language model here, but you now see its core mechanism — counting n-grams — with your own eyes.
The trade-off: bigger n is not better
It is tempting to think "if bigrams help, 5-grams must help more." They
usually do not. As n grows, each specific n-gram becomes rarer, until almost
every one appears just once and counting them tells you nothing. This is the
sparsity problem.
Watch the "that repeat" column collapse as n grows. At n=1 and n=2 many
n-grams recur, so counts are meaningful. By n=4 almost every window is
unique — there is no pattern left to count. In practice, bigrams and
trigrams are the sweet spot: enough context to be useful, common enough to
still carry statistical signal. Going higher usually buys sparsity, a
ballooning feature space, and little else.
Two costs of large n
Sparsity: rare n-grams give unreliable counts. Explosion: the number
of possible n-grams grows astronomically with n, so your feature space and
memory blow up while most entries are zero. Both push you toward small n.
Reach past trigrams only with a specific reason and a lot of data.
Why does adding the bigram ("not", "good") help a sentiment model that
single words could not?
Bigrams are faster to compute than single words
A lone "not" and a lone "good" do not capture that they were adjacent; the bigram preserves the local order so the model can learn that "not good" is negative
Bigrams remove stopwords automatically
Bigrams translate the text into another language
Your turn: generate n-grams
Write a function get_ngrams(tokens, n) that returns a list of all
n-grams of tokens, where each n-gram is a tuple of n adjacent tokens. Use
ngrams from nltk.util (already imported) and remember it returns a
generator, so wrap it in list(...).
For example:
get_ngrams(["a", "b", "c", "d"], 2)->[("a","b"), ("b","c"), ("c","d")]get_ngrams(["a", "b", "c", "d"], 3)->[("a","b","c"), ("b","c","d")]
Check your understanding
A trigram is:
A word that appears exactly three times
A contiguous sequence of three adjacent tokens
The three most common words in a text
A sentence with three words
A token list has 12 tokens. How many bigrams does it produce?
12
11
6
24
Why are bigrams and trigrams usually preferred over much larger n-grams (say, 6-grams)?
Larger n-grams are illegal in NLTK
As n grows, specific n-grams become increasingly rare (sparse) and the feature space explodes, so large n-grams carry little statistical signal while costing a lot of memory
Larger n-grams are always slower to type
Bigrams capture the entire meaning of any document
nltk.util.ngrams(tokens, n) returns a generator. What must you do to get a
reusable list of tuples you can index and count?
Nothing; a generator already behaves like a list
Wrap it in list(...), e.g. list(ngrams(tokens, n)), to materialize the n-grams into a list
Convert it to a string with str(...)
Call .sort() on the generator
You have now met every classic building block: tokens, normalized tokens, filtered tokens, roots, counts, tags, and n-grams. The final section puts them to work — starting with the most important skill of all: choosing which of these steps your task actually needs.
Part-of-Speech (POS) Tagging
Labeling each word with its grammatical role — noun, verb, adjective — using NLTK's pos_tag. Why context decides the tag, the Penn Treebank tagset, building syntactic patterns, and using POS tags to lemmatize accurately.
Designing the Right Preprocessing Pipeline
The most important skill in classic NLP — choosing which preprocessing steps to apply. There is no universal pipeline; each step adds or destroys information, and the right choice depends entirely on the downstream task. A decision framework, a task-by-task table, and how to evaluate choices empirically.