From Text to Features: Bag-of-Words
Algorithms need numbers, not words. The bag-of-words model turns each document into a vector of word counts over a shared vocabulary. Building a document-term matrix by hand, binary vs. count features, and the limitations (lost order, sparsity, no semantics) that motivate n-grams and beyond.
Everything so far has produced words — clean tokens, roots, tags, n-grams. But most algorithms that use text, from a spam filter to a clustering routine, do not consume words. They consume numbers. The bridge from a list of tokens to a row of numbers is called feature extraction, and the oldest, simplest, and still-everywhere technique for it is the bag-of-words model.
The name is the whole idea: take a document, throw all its words into a bag, and shake it. You keep which words appear and how many times, but you throw away the order. A document becomes a tally of word counts — and a tally is just a vector of numbers, which is exactly what an algorithm can work with.
The feature extraction process
Bag-of-words turns a whole collection of documents into a table of numbers in a fixed sequence of steps.
The output is a document-term matrix: one row per document, one column per vocabulary word, and each cell holding how many times that word appeared in that document. Every row is a numeric vector that represents its document, and that is precisely the form a classifier wants.
Building it by hand
The model is simple enough to build from scratch with a Counter, and doing
so demystifies it completely. Watch a tiny three-document corpus become a
matrix of numbers.
That table is the bag-of-words representation. Read doc1's row: "the" appears twice (a 2 in the "the" column), "cat", "sat", "on", and "mat" once each, and every other vocabulary word zero times. The sentence has become a row of integers — and crucially, all three rows have the same length (the size of the vocabulary), so they can be compared, added, and fed to an algorithm uniformly.
Why a shared vocabulary matters
Every document is counted against the same vocabulary, so every document vector has the same dimensions in the same order. That alignment is what lets an algorithm treat "count of the word 'cat'" as a consistent feature across all documents. Build the vocabulary once from your whole corpus, then vectorize each document against it.
Counts vs. presence (binary bag-of-words)
Sometimes you care only whether a word appears, not how often — for short texts or spam-style signals, presence is enough and is less swayed by length. That variant is binary bag-of-words: each cell is 1 or 0 instead of a count.
The count vector records that "free" appeared three times (a strong spam signal); the binary vector just records that it appeared at all. Which is better depends on the task — another small pipeline choice, in the spirit of the previous page.
In practice you'd reach for a vectorizer
Building bag-of-words by hand, as we just did, is the right way to understand
it. In real projects you would typically use a ready-made vectorizer (such as
scikit-learn's CountVectorizer) that does tokenizing, vocabulary-building, and
counting in one step and returns an efficient sparse matrix. The mechanics are
identical to what you just coded — vocabulary, then per-document counts — so you
now know exactly what such a tool produces under the hood.
The limitations (and what they motivate)
Bag-of-words is powerful for its simplicity, but its simplifications are real and worth naming, because each one points to a technique you have met or will meet.
- It ignores word order. "dog bites man" and "man bites dog" produce the identical bag. This is the blind spot n-grams patch: adding bigram columns ("not good", "machine learning") puts a little order back in.
- It is sparse and high-dimensional. A real vocabulary has tens of thousands of words, so each document vector is mostly zeros. This costs memory and is why real tools use sparse matrices, and why stopword removal and stemming/lemmatization (which shrink the vocabulary) help here.
- It has no notion of meaning. "great" and "excellent" are different columns with nothing connecting them; the model has no idea they are near-synonyms. Capturing meaning is what word embeddings (like Word2Vec) were invented for — conceptually, they place similar words near each other in a numeric space so "great" and "excellent" are close rather than unrelated. That is a topic for later study; bag-of-words remains the right first model to understand.
Bag-of-words throws away order — on purpose
The "bag" metaphor is a literal description: order is gone. For many tasks (topic detection, spam filtering) that loss is acceptable and the simplicity is worth it. For tasks where order is meaning (sentiment, where "not good" must not look like "good not"), compensate by adding n-gram features. Knowing what bag-of-words discards tells you exactly when you need to shore it up.
In a bag-of-words representation, what does each column of the document-term matrix correspond to?
One document in the corpus
One word in the shared vocabulary, with each cell counting that word's occurrences in a document
One sentence in a document
One character in the text
Your turn: vectorize a document
Write a function bow_vector(tokens, vocab) that returns the bag-of-words
count vector for one document: a list with the same length as vocab,
where position i holds the number of times vocab[i] appears in tokens.
Words in tokens that are not in vocab are simply ignored.
For example, with vocab = ["cat", "dog", "fish"] and
tokens = ["dog", "cat", "dog", "bird"], the result is [1, 2, 0] — one
"cat", two "dog", zero "fish", and "bird" ignored.
Tip: collections.Counter makes this clean (it returns 0 for missing keys).
Check your understanding
Why do we convert text to a bag-of-words vector at all?
Because vectors take less disk space than text
Because most algorithms operate on numbers, not words, so each document must be turned into a fixed-length numeric vector before it can be fed to them
Because it makes the text easier for humans to read
Because it translates the document into another language
"dog bites man" and "man bites dog" produce the same bag-of-words vector. What does this reveal, and how is it commonly addressed?
It is a bug in the counting code
Bag-of-words discards word order, so different orderings of the same words collapse together; adding n-gram (e.g., bigram) features puts some local order back
It means the two sentences have different vocabularies
It proves bag-of-words cannot be used for any real task
How does binary bag-of-words differ from count bag-of-words?
Binary uses words; count uses numbers
Binary records only whether each word is present (1 or 0); count records how many times it appears
Binary is always more accurate than count
They produce vectors of different lengths
Bag-of-words gives "great" and "excellent" entirely separate columns with no connection between them. What limitation is this, and what technique is aimed at it?
A sparsity problem, fixed by stopword removal
It captures no meaning/similarity between words; word embeddings (e.g., Word2Vec) address it by placing similar words close together in a numeric space
A word-order problem, fixed by n-grams
There is no limitation here
You can now turn documents into numeric features. In the final page we put the entire course together — preprocessing choices, features, and a touch of n-gram thinking — to build a small but complete rule-based sentiment classifier from scratch, and watch every earlier lesson pay off.
Designing the Right Preprocessing Pipeline
The most important skill in classic NLP — choosing which preprocessing steps to apply. There is no universal pipeline; each step adds or destroys information, and the right choice depends entirely on the downstream task. A decision framework, a task-by-task table, and how to evaluate choices empirically.
A Rule-Based Sentiment Classifier
The capstone — assemble tokenization, normalization, deliberate stopword choices, and negation handling into a working lexicon-based sentiment classifier. Watch why keeping negation matters, see how text flows through the whole pipeline, and learn the limits of rule-based methods.