Word Frequencies and Lexical Diversity

You now know how to turn a paragraph into a clean list of tokens. The simplest — and most useful — thing you can do with that list is count. How many words are there? How many distinct words? Which words appear most? How varied is the vocabulary? These basic statistics are the foundation of search ranking, keyword extraction, spam scoring, and a surprising amount of "text analytics".

Counting words with `FreqDist`

You could count words with a plain collections.Counter, and that is exactly what is happening under the hood. NLTK wraps it in a FreqDist ("frequency distribution") that adds text-specific conveniences.

A few FreqDist methods you will use constantly:

fd.most_common(n) — the n highest-frequency words as (word, count) pairs, already sorted. The everyday workhorse.
fd.N() — the total number of word tokens (counting repeats).
len(fd) — the number of distinct words (the vocabulary size).
fd["word"] — the count of a word; missing words return 0 rather than raising, because FreqDist is a dict subclass.
fd.freq("word") — the relative frequency, i.e. count divided by N().

FreqDist is just a smarter Counter

FreqDist subclasses collections.Counter, so everything you know about Counter works, and you can build one from any iterable of tokens. NLTK adds text-flavored helpers like most_common, hapaxes, freq, and plotting. If you ever do not have NLTK handy, Counter(tokens).most_common(n) gets you the same top-words list.

Lexical diversity: how varied is the vocabulary?

Two texts can have the same length but very different richness. A text that says "very very very good" uses four words but only two distinct ones. The ratio of distinct words to total words is called lexical diversity, and it is a quick, classic measure of how repetitive or varied a text is.

$\text{lexical diversity} = \frac{\text{number of distinct words}}{\text{total number of words}}$

A value near 1 means almost every word is unique (highly varied); a value near 0 means heavy repetition.

The repetitive text scores low; the varied text scores near 1. Lexical diversity is used to compare writing styles, gauge reading level, flag machine-generated or spammy text (which often repeats), and track how an author's vocabulary changes across works.

Lexical diversity depends on length

Longer texts almost always have lower lexical diversity, because common words inevitably repeat as the text grows. This means you cannot fairly compare the raw diversity of a tweet to that of a novel. For honest comparisons, measure equal-length samples (or use length-corrected variants). The simple ratio is a great intuition-builder, not a length-proof metric.

The long tail: language is lopsided

If you sort words by frequency, you find a striking, universal pattern: a tiny number of words are extraordinarily common, and a huge number appear just once or twice. The single most frequent word ("the") often accounts for ~7% of all tokens by itself, and the "long tail" of rare words stretches on forever. (This regularity is known as Zipf's law.)

Words that appear exactly once have their own name — hapaxes — and FreqDist will list them for you. Let us visualize the lopsidedness with a bar chart of the top words.

The bar for "the" dwarfs the others, and a pile of words appear only once. This shape repeats in essentially every natural-language text.

A catch: raw frequency is dominated by stopwords

The long-tail shape leads directly to a practical problem. If you just ask "what are the most frequent words?", the answer is almost always the stopwords — "the", "a", "of" — which tell you nothing about the topic. The content words you actually care about are buried below them.

With stopwords, the top words are "the", "a", "that" — useless for guessing the topic. Without them, "text", "words", "search", "user", and "engine" rise to the top — and you can immediately tell this passage is about search. This is why keyword extraction almost always removes stopwords first, and it is a concrete payoff of the choice you studied two pages ago.

Frequency is a starting point, not the finish line

"Most frequent equals most important" is a tempting but flawed heuristic. The most frequent words are usually the least informative (stopwords), and even among content words, a word common in this document but also common in every document is not very distinctive. The classic fix is to weight a word by how rare it is across documents — the idea behind TF-IDF. You do not need it yet, but file away that frequency alone over-rewards the commonplace.

QuestionSelect one

You build a FreqDist over the raw tokens of a news article and ask for most_common(5). The result is "the", "to", "of", "a", "and". Why is this unhelpful for figuring out the article's topic, and what is the usual fix?

The article has no topic; nothing can be done

The most frequent words are stopwords that appear in nearly all text; removing stopwords first surfaces the content words that actually indicate the topic

FreqDist counted wrong; use Counter instead

You must lemmatize before any counting will work at all

Your turn: basic text statistics

Implement two functions over a list of tokens:

lexical_diversity(tokens) — return the number of distinct tokens divided by the total number of tokens, as a float. For an empty list, return 0.0 (avoid dividing by zero).
top_words(tokens, n) — return the n most common tokens as a list of (word, count) pairs. Use FreqDist (already imported).

For example, with tokens = ["a", "b", "a", "c", "a", "b"], lexical_diversity(tokens) is 0.5 and top_words(tokens, 2) is [("a", 3), ("b", 2)].

Check your understanding

QuestionSelect one

What does fd.most_common(3) return for a FreqDist fd?

The three rarest words in the text

The three highest-frequency items as (word, count) pairs, sorted from most to least frequent

A random sample of three words

The first three words in the original text

QuestionSelect one

A text of 100 tokens contains 40 distinct words. What is its lexical diversity, and what does that number mean?

100 / 40 = 2.5; the text is highly varied

40 / 100 = 0.4; on average each distinct word is used about 2.5 times, indicating moderate repetition

0.4; the text is completely unique with no repeats

It cannot be computed without removing stopwords

QuestionSelect one

Language follows a long-tailed (Zipfian) frequency pattern. Which statement captures a practical consequence?

Every word in a text appears roughly the same number of times

A handful of words (mostly stopwords) account for a large share of all tokens, while a great many words appear only once or twice — so raw frequency is dominated by uninformative words

Rare words are always more frequent than common words

Frequency counts are impossible to compute for real text

QuestionSelect one

What is a hapax (as in fd.hapaxes())?

A word that appears in every document

A word that appears exactly once in the text

The most frequent word in the text

A punctuation token

You can now measure which words appear and how often. But counts treat every word as an isolated atom — they have no idea that "book" can be a noun or a verb. To capture that, we need to know each word's role in the sentence: part-of-speech tagging, next.

Counting words with FreqDist

Lexical diversity: how varied is the vocabulary?

The long tail: language is lopsided

A catch: raw frequency is dominated by stopwords

Your turn: basic text statistics

Check your understanding

Word Frequencies and Lexical Diversity

Counting words with FreqDist

Lexical diversity: how varied is the vocabulary?

The long tail: language is lopsided

A catch: raw frequency is dominated by stopwords

Your turn: basic text statistics

Check your understanding

On this page

Counting words with `FreqDist`