N-grams: Bigrams and Trigrams

Every analysis step so far has a quiet blind spot: it treats words individually. A bag of words knows the document contains "not" and "good" but has no idea they sat next to each other as "not good". It sees "New" and "York" but not "New York". Word order carries meaning, and to capture even a little of it we use n-grams.

An n-gram is a contiguous sequence of n tokens. A 1-gram (unigram) is a single word. A 2-gram (bigram) is a pair of adjacent words. A 3-gram (trigram) is a run of three. By looking at adjacent groups instead of lone words, n-grams capture local context that single tokens lose.

The sliding window

The mental image is a window of width n that slides along the token list one step at a time. Each position is one n-gram, and consecutive windows overlap.

Notice the overlap: "quick" appears in both window 1 and window 2. That is deliberate — overlapping windows ensure every adjacent pair is captured. For a list of k tokens there are k - n + 1 n-grams (here, 4 - 2 + 1 = 3 bigrams).

Generating n-grams with `nltk.util.ngrams`

NLTK gives you ngrams(tokens, n), which yields the windows as tuples. It returns a generator, so wrap it in list(...) to see or reuse the results.

Each bigram is a 2-tuple of adjacent words; each trigram is a 3-tuple. The counts match the formula: 5 tokens give 5 - 2 + 1 = 4 bigrams and 5 - 3 + 1 = 3 trigrams.

Several ways to the same n-grams

nltk.util.ngrams(tokens, n) is the general tool — pass any n. NLTK also offers the convenience shortcuts nltk.bigrams(tokens) and nltk.trigrams(tokens) for the two most common cases. All three return generators of tuples, so wrap them in list() to materialize them. We use ngrams(tokens, n) here because it makes the role of n explicit.

Why n-grams matter: order is meaning

Two quick demonstrations of what unigrams miss and bigrams catch.

The bigram ("not", "good") is a concrete, countable feature that a sentiment model can learn is negative — something no single-word feature can express. This is one of the simplest, most effective upgrades to a bag-of-words model: add bigrams so that negations and key phrases survive.

Counting n-grams reveals phrases

Counting n-grams (with FreqDist or Counter, just like words) surfaces the common phrases in a text — the building blocks of autocomplete, phrase search, and collocation discovery.

"machine learning" rises to the top as the most frequent bigram — the counter discovered a meaningful two-word phrase purely from co-occurrence. This is the seed of collocation detection (finding word pairs that go together more than chance would predict) and of next-word prediction: given "machine", the data suggests "learning" is a likely follow-up.

N-grams are the intuition behind autocomplete

When your phone suggests the next word, a classic approach is an n-gram language model: count which word most often follows the previous one or two words, and suggest that. "United" is often followed by "States"; "machine" by "learning". You are not building a full language model here, but you now see its core mechanism — counting n-grams — with your own eyes.

The trade-off: bigger n is not better

It is tempting to think "if bigrams help, 5-grams must help more." They usually do not. As n grows, each specific n-gram becomes rarer, until almost every one appears just once and counting them tells you nothing. This is the sparsity problem.

Watch the "that repeat" column collapse as n grows. At n=1 and n=2 many n-grams recur, so counts are meaningful. By n=4 almost every window is unique — there is no pattern left to count. In practice, bigrams and trigrams are the sweet spot: enough context to be useful, common enough to still carry statistical signal. Going higher usually buys sparsity, a ballooning feature space, and little else.

Two costs of large n

Sparsity: rare n-grams give unreliable counts. Explosion: the number of possible n-grams grows astronomically with n, so your feature space and memory blow up while most entries are zero. Both push you toward small n. Reach past trigrams only with a specific reason and a lot of data.

QuestionSelect one

Why does adding the bigram ("not", "good") help a sentiment model that single words could not?

Bigrams are faster to compute than single words

A lone "not" and a lone "good" do not capture that they were adjacent; the bigram preserves the local order so the model can learn that "not good" is negative

Bigrams remove stopwords automatically

Bigrams translate the text into another language

Your turn: generate n-grams

Write a function get_ngrams(tokens, n) that returns a list of all n-grams of tokens, where each n-gram is a tuple of n adjacent tokens. Use ngrams from nltk.util (already imported) and remember it returns a generator, so wrap it in list(...).

For example:

get_ngrams(["a", "b", "c", "d"], 2) -> [("a","b"), ("b","c"), ("c","d")]
get_ngrams(["a", "b", "c", "d"], 3) -> [("a","b","c"), ("b","c","d")]

Check your understanding

QuestionSelect one

A trigram is:

A word that appears exactly three times

A contiguous sequence of three adjacent tokens

The three most common words in a text

A sentence with three words

QuestionSelect one

A token list has 12 tokens. How many bigrams does it produce?

QuestionSelect one

Why are bigrams and trigrams usually preferred over much larger n-grams (say, 6-grams)?

Larger n-grams are illegal in NLTK

As n grows, specific n-grams become increasingly rare (sparse) and the feature space explodes, so large n-grams carry little statistical signal while costing a lot of memory

Larger n-grams are always slower to type

Bigrams capture the entire meaning of any document

QuestionSelect one

nltk.util.ngrams(tokens, n) returns a generator. What must you do to get a reusable list of tuples you can index and count?

Nothing; a generator already behaves like a list

Wrap it in list(...), e.g. list(ngrams(tokens, n)), to materialize the n-grams into a list

Convert it to a string with str(...)

Call .sort() on the generator

You have now met every classic building block: tokens, normalized tokens, filtered tokens, roots, counts, tags, and n-grams. The final section puts them to work — starting with the most important skill of all: choosing which of these steps your task actually needs.

The sliding window

Generating n-grams with nltk.util.ngrams

Why n-grams matter: order is meaning

Counting n-grams reveals phrases

The trade-off: bigger n is not better

Your turn: generate n-grams

Check your understanding

N-grams: Bigrams and Trigrams

The sliding window

Generating n-grams with nltk.util.ngrams

Why n-grams matter: order is meaning

Counting n-grams reveals phrases

The trade-off: bigger n is not better

Your turn: generate n-grams

Check your understanding

On this page

Generating n-grams with `nltk.util.ngrams`