Word Frequencies and Lexical Diversity
Once text is clean tokens, the first thing you measure is how often each word appears. NLTK's FreqDist, .most_common(), lexical diversity, hapaxes, and the long-tailed shape of language — plus why raw frequency alone is dominated by stopwords.
You now know how to turn a paragraph into a clean list of tokens. The simplest — and most useful — thing you can do with that list is count. How many words are there? How many distinct words? Which words appear most? How varied is the vocabulary? These basic statistics are the foundation of search ranking, keyword extraction, spam scoring, and a surprising amount of "text analytics".
Counting words with FreqDist
You could count words with a plain collections.Counter, and that is exactly
what is happening under the hood. NLTK wraps it in a FreqDist ("frequency
distribution") that adds text-specific conveniences.
A few FreqDist methods you will use constantly:
fd.most_common(n)— thenhighest-frequency words as(word, count)pairs, already sorted. The everyday workhorse.fd.N()— the total number of word tokens (counting repeats).len(fd)— the number of distinct words (the vocabulary size).fd["word"]— the count of a word; missing words return0rather than raising, becauseFreqDistis adictsubclass.fd.freq("word")— the relative frequency, i.e. count divided byN().
FreqDist is just a smarter Counter
FreqDist subclasses collections.Counter, so everything you know about
Counter works, and you can build one from any iterable of tokens. NLTK adds
text-flavored helpers like most_common, hapaxes, freq, and plotting. If
you ever do not have NLTK handy, Counter(tokens).most_common(n) gets you the
same top-words list.
Lexical diversity: how varied is the vocabulary?
Two texts can have the same length but very different richness. A text that says "very very very good" uses four words but only two distinct ones. The ratio of distinct words to total words is called lexical diversity, and it is a quick, classic measure of how repetitive or varied a text is.
A value near 1 means almost every word is unique (highly varied); a value near 0 means heavy repetition.
The repetitive text scores low; the varied text scores near 1. Lexical diversity is used to compare writing styles, gauge reading level, flag machine-generated or spammy text (which often repeats), and track how an author's vocabulary changes across works.
Lexical diversity depends on length
Longer texts almost always have lower lexical diversity, because common words inevitably repeat as the text grows. This means you cannot fairly compare the raw diversity of a tweet to that of a novel. For honest comparisons, measure equal-length samples (or use length-corrected variants). The simple ratio is a great intuition-builder, not a length-proof metric.
The long tail: language is lopsided
If you sort words by frequency, you find a striking, universal pattern: a tiny number of words are extraordinarily common, and a huge number appear just once or twice. The single most frequent word ("the") often accounts for ~7% of all tokens by itself, and the "long tail" of rare words stretches on forever. (This regularity is known as Zipf's law.)
Words that appear exactly once have their own name — hapaxes — and
FreqDist will list them for you. Let us visualize the lopsidedness with a
bar chart of the top words.
The bar for "the" dwarfs the others, and a pile of words appear only once. This shape repeats in essentially every natural-language text.
A catch: raw frequency is dominated by stopwords
The long-tail shape leads directly to a practical problem. If you just ask "what are the most frequent words?", the answer is almost always the stopwords — "the", "a", "of" — which tell you nothing about the topic. The content words you actually care about are buried below them.
With stopwords, the top words are "the", "a", "that" — useless for guessing the topic. Without them, "text", "words", "search", "user", and "engine" rise to the top — and you can immediately tell this passage is about search. This is why keyword extraction almost always removes stopwords first, and it is a concrete payoff of the choice you studied two pages ago.
Frequency is a starting point, not the finish line
"Most frequent equals most important" is a tempting but flawed heuristic. The most frequent words are usually the least informative (stopwords), and even among content words, a word common in this document but also common in every document is not very distinctive. The classic fix is to weight a word by how rare it is across documents — the idea behind TF-IDF. You do not need it yet, but file away that frequency alone over-rewards the commonplace.
You build a FreqDist over the raw tokens of a news article and ask for
most_common(5). The result is "the", "to", "of", "a", "and". Why is this
unhelpful for figuring out the article's topic, and what is the usual fix?
The article has no topic; nothing can be done
The most frequent words are stopwords that appear in nearly all text; removing stopwords first surfaces the content words that actually indicate the topic
FreqDist counted wrong; use Counter instead
You must lemmatize before any counting will work at all
Your turn: basic text statistics
Implement two functions over a list of tokens:
lexical_diversity(tokens)— return the number of distinct tokens divided by the total number of tokens, as a float. For an empty list, return0.0(avoid dividing by zero).top_words(tokens, n)— return thenmost common tokens as a list of(word, count)pairs. UseFreqDist(already imported).
For example, with tokens = ["a", "b", "a", "c", "a", "b"],
lexical_diversity(tokens) is 0.5 and top_words(tokens, 2) is
[("a", 3), ("b", 2)].
Check your understanding
What does fd.most_common(3) return for a FreqDist fd?
The three rarest words in the text
The three highest-frequency items as (word, count) pairs, sorted from most to least frequent
A random sample of three words
The first three words in the original text
A text of 100 tokens contains 40 distinct words. What is its lexical diversity, and what does that number mean?
100 / 40 = 2.5; the text is highly varied
40 / 100 = 0.4; on average each distinct word is used about 2.5 times, indicating moderate repetition
0.4; the text is completely unique with no repeats
It cannot be computed without removing stopwords
Language follows a long-tailed (Zipfian) frequency pattern. Which statement captures a practical consequence?
Every word in a text appears roughly the same number of times
A handful of words (mostly stopwords) account for a large share of all tokens, while a great many words appear only once or twice — so raw frequency is dominated by uninformative words
Rare words are always more frequent than common words
Frequency counts are impossible to compute for real text
What is a hapax (as in fd.hapaxes())?
A word that appears in every document
A word that appears exactly once in the text
The most frequent word in the text
A punctuation token
You can now measure which words appear and how often. But counts treat every word as an isolated atom — they have no idea that "book" can be a noun or a verb. To capture that, we need to know each word's role in the sentence: part-of-speech tagging, next.
Stemming vs. Lemmatization
Two ways to collapse 'run', 'runs', 'running', and 'ran' into one root. Stemming chops suffixes with fast rules and may produce non-words; lemmatization looks up real dictionary forms but needs part-of-speech. Which to choose, and why lemmatization wins for meaning.
Part-of-Speech (POS) Tagging
Labeling each word with its grammatical role — noun, verb, adjective — using NLTK's pos_tag. Why context decides the tag, the Penn Treebank tagset, building syntactic patterns, and using POS tags to lemmatize accurately.