Natural Language Processing with Python

Foundations

What Is Natural Language Processing?The Anatomy of an NLP Pipeline

Tokenization & Normalization

Sentence and Word Tokenization Text Normalization: Case Folding and Punctuation Identifying and Removing Stopwords Stemming vs. Lemmatization

Counting & Structure

Word Frequencies and Lexical Diversity Part-of-Speech (POS) Tagging N-grams: Bigrams and Trigrams

Putting It Together

Designing the Right Preprocessing Pipeline From Text to Features: Bag-of-Words A Rule-Based Sentiment Classifier Where to Go Next

From Text to Features: Bag-of-Words

Algorithms need numbers, not words. The bag-of-words model turns each document into a vector of word counts over a shared vocabulary. Building a document-term matrix by hand, binary vs. count features, and the limitations (lost order, sparsity, no semantics) that motivate n-grams and beyond.

Everything so far has produced words — clean tokens, roots, tags, n-grams. But most algorithms that use text, from a spam filter to a clustering routine, do not consume words. They consume numbers. The bridge from a list of tokens to a row of numbers is called feature extraction, and the oldest, simplest, and still-everywhere technique for it is the bag-of-words model.

The name is the whole idea: take a document, throw all its words into a bag, and shake it. You keep which words appear and how many times, but you throw away the order. A document becomes a tally of word counts — and a tally is just a vector of numbers, which is exactly what an algorithm can work with.

The feature extraction process

Bag-of-words turns a whole collection of documents into a table of numbers in a fixed sequence of steps.

The output is a document-term matrix: one row per document, one column per vocabulary word, and each cell holding how many times that word appeared in that document. Every row is a numeric vector that represents its document, and that is precisely the form a classifier wants.

Building it by hand

The model is simple enough to build from scratch with a Counter, and doing so demystifies it completely. Watch a tiny three-document corpus become a matrix of numbers.

Code Block

Python 3.13.2

That table is the bag-of-words representation. Read doc1's row: "the" appears twice (a 2 in the "the" column), "cat", "sat", "on", and "mat" once each, and every other vocabulary word zero times. The sentence has become a row of integers — and crucially, all three rows have the same length (the size of the vocabulary), so they can be compared, added, and fed to an algorithm uniformly.

Why a shared vocabulary matters

Every document is counted against the same vocabulary, so every document vector has the same dimensions in the same order. That alignment is what lets an algorithm treat "count of the word 'cat'" as a consistent feature across all documents. Build the vocabulary once from your whole corpus, then vectorize each document against it.

Counts vs. presence (binary bag-of-words)

Sometimes you care only whether a word appears, not how often — for short texts or spam-style signals, presence is enough and is less swayed by length. That variant is binary bag-of-words: each cell is 1 or 0 instead of a count.

Code Block

Python 3.13.2

The count vector records that "free" appeared three times (a strong spam signal); the binary vector just records that it appeared at all. Which is better depends on the task — another small pipeline choice, in the spirit of the previous page.

In practice you'd reach for a vectorizer

Building bag-of-words by hand, as we just did, is the right way to understand it. In real projects you would typically use a ready-made vectorizer (such as scikit-learn's CountVectorizer) that does tokenizing, vocabulary-building, and counting in one step and returns an efficient sparse matrix. The mechanics are identical to what you just coded — vocabulary, then per-document counts — so you now know exactly what such a tool produces under the hood.

The limitations (and what they motivate)

Bag-of-words is powerful for its simplicity, but its simplifications are real and worth naming, because each one points to a technique you have met or will meet.

It ignores word order. "dog bites man" and "man bites dog" produce the identical bag. This is the blind spot n-grams patch: adding bigram columns ("not good", "machine learning") puts a little order back in.
It is sparse and high-dimensional. A real vocabulary has tens of thousands of words, so each document vector is mostly zeros. This costs memory and is why real tools use sparse matrices, and why stopword removal and stemming/lemmatization (which shrink the vocabulary) help here.
It has no notion of meaning. "great" and "excellent" are different columns with nothing connecting them; the model has no idea they are near-synonyms. Capturing meaning is what word embeddings (like Word2Vec) were invented for — conceptually, they place similar words near each other in a numeric space so "great" and "excellent" are close rather than unrelated. That is a topic for later study; bag-of-words remains the right first model to understand.

Bag-of-words throws away order — on purpose

The "bag" metaphor is a literal description: order is gone. For many tasks (topic detection, spam filtering) that loss is acceptable and the simplicity is worth it. For tasks where order is meaning (sentiment, where "not good" must not look like "good not"), compensate by adding n-gram features. Knowing what bag-of-words discards tells you exactly when you need to shore it up.

QuestionSelect one

In a bag-of-words representation, what does each column of the document-term matrix correspond to?

One document in the corpus

One word in the shared vocabulary, with each cell counting that word's occurrences in a document

One sentence in a document

One character in the text

Your turn: vectorize a document

Challenge

Python 3.13.2

Turn a document into a bag-of-words vector

Write a function bow_vector(tokens, vocab) that returns the bag-of-words count vector for one document: a list with the same length as vocab, where position i holds the number of times vocab[i] appears in tokens. Words in tokens that are not in vocab are simply ignored.

For example, with vocab = ["cat", "dog", "fish"] and tokens = ["dog", "cat", "dog", "bird"], the result is [1, 2, 0] — one "cat", two "dog", zero "fish", and "bird" ignored.

Tip: collections.Counter makes this clean (it returns 0 for missing keys).

Check your understanding

QuestionSelect one

Why do we convert text to a bag-of-words vector at all?

Because vectors take less disk space than text

Because most algorithms operate on numbers, not words, so each document must be turned into a fixed-length numeric vector before it can be fed to them

Because it makes the text easier for humans to read

Because it translates the document into another language

QuestionSelect one

"dog bites man" and "man bites dog" produce the same bag-of-words vector. What does this reveal, and how is it commonly addressed?

It is a bug in the counting code

Bag-of-words discards word order, so different orderings of the same words collapse together; adding n-gram (e.g., bigram) features puts some local order back

It means the two sentences have different vocabularies

It proves bag-of-words cannot be used for any real task

QuestionSelect one

How does binary bag-of-words differ from count bag-of-words?

Binary uses words; count uses numbers

Binary records only whether each word is present (1 or 0); count records how many times it appears

Binary is always more accurate than count

They produce vectors of different lengths

QuestionSelect one

Bag-of-words gives "great" and "excellent" entirely separate columns with no connection between them. What limitation is this, and what technique is aimed at it?

A sparsity problem, fixed by stopword removal

It captures no meaning/similarity between words; word embeddings (e.g., Word2Vec) address it by placing similar words close together in a numeric space

A word-order problem, fixed by n-grams

There is no limitation here

You can now turn documents into numeric features. In the final page we put the entire course together — preprocessing choices, features, and a touch of n-gram thinking — to build a small but complete rule-based sentiment classifier from scratch, and watch every earlier lesson pay off.

Designing the Right Preprocessing Pipeline

The most important skill in classic NLP — choosing which preprocessing steps to apply. There is no universal pipeline; each step adds or destroys information, and the right choice depends entirely on the downstream task. A decision framework, a task-by-task table, and how to evaluate choices empirically.

A Rule-Based Sentiment Classifier

The capstone — assemble tokenization, normalization, deliberate stopword choices, and negation handling into a working lexicon-based sentiment classifier. Watch why keeping negation matters, see how text flows through the whole pipeline, and learn the limits of rule-based methods.

On this page

The feature extraction process Building it by hand Counts vs. presence (binary bag-of-words)The limitations (and what they motivate)Your turn: vectorize a document Check your understanding