Text Normalization: Case Folding and Punctuation

After tokenization you have a list of word-shaped strings. But to a computer, "The", "the", and "THE" are three completely different strings, and "dog" and "dog." are two. If you counted words right now, you would triple-count "the" and miss matches constantly. Normalization is the step that reduces this surface-level variation so that tokens which should be treated as the same actually are.

The two workhorse normalization steps — and the focus of this page — are case folding (lowercasing) and punctuation stripping. Both are simple to do and surprisingly consequential to get wrong.

Case folding: collapsing "The", "the", and "THE"

Computers compare strings by their exact characters, so "The" == "the" is False. For most counting and matching tasks, that distinction is noise: you do not care that one "the" started a sentence and another did not. Case folding — usually just str.lower() — collapses all of them into one form.

Six "different" words became two. If your goal is to count how often each word appears, that collapse is exactly right — "The" at the start of a sentence and "the" in the middle are the same word, and you want them tallied together.

lower() vs casefold()

Python has two lowercasing methods. str.lower() is the everyday choice. str.casefold() is more aggressive and handles some non-English cases — most famously the German "ß", which casefold() turns into "ss". For English text they behave identically; reach for casefold() when you are doing case-insensitive matching across many languages. Throughout this course we use lower().

When you should NOT lowercase

Here is where beginners get burned: lowercasing is lossy, and sometimes the case carries real information you need. Folding it away is a one-way door.

Named entities. "Apple" the company and "apple" the fruit are different things, and capitalization is the main clue. Lowercase first and you have thrown away the signal a name-recognition system relies on.
Acronyms. "US" (United States) becomes "us" (the pronoun). "WHO" (the health organization) becomes "who". These collisions can be serious.
Sentiment and emphasis. "this is FINE" shouted in all caps carries different emotional weight than "this is fine". For some sentiment tasks, ALL-CAPS is a feature worth keeping, not erasing.
Part-of-speech tagging. Taggers use capitalization as a hint that a word is a proper noun. Lowercasing before tagging removes that hint and can make the tagger worse — a reason to tag first, normalize later.

Normalization is information destruction (on purpose)

Every normalization step deletes information so that the remaining tokens compare more easily. That is its entire job — but it means the step can never be undone, and if the information you deleted mattered to your task, you have quietly made things worse. Always ask: what am I throwing away, and does my task need it? For plain word-counting, casing is noise. For entity recognition, casing is signal.

Punctuation stripping

The second normalization workhorse is removing punctuation. After tokenization you often have standalone punctuation tokens (",", "!", ".") and sometimes punctuation still clinging to words. For most bag-of-words style tasks, punctuation is noise you want gone. There are a few common ways to strip it; know more than one.

Compare the three results carefully — they disagree, and the disagreements are instructive:

Approach 1 (isalpha filter) drops "World!" entirely because the whole token "World!" is not alphabetic. That is probably not what you wanted — you lost the word, not just the punctuation.
Approach 2 (translate) removes punctuation characters, turning "World!" into "world" and "It's" into "its". It keeps the words but glues contractions together ("its").
Approach 3 (regex [a-z]+) extracts runs of letters, so "isn't" becomes two tokens, "isn" and "t".

There is no single right answer; each approach makes a different trade-off about contractions and partial-punctuation tokens. The lesson is to look at what your stripping actually does rather than trusting it blindly.

The order trap, again

Strip punctuation after tokenizing, not before. If you delete all the periods first, you destroy the very clues a sentence tokenizer needs to find sentence boundaries — and you may even merge two sentences into one run-on. Tokenize, then clean.

QuestionSelect one

Why is lowercasing described as a lossy operation?

Because it makes the text take up less memory

Because it permanently discards capitalization, which sometimes carries real meaning (e.g., "Apple" vs "apple", "US" vs "us"), and that information cannot be recovered afterward

Because it always introduces spelling errors

Because it converts the text into a different language

What normalization gives and takes

It is worth holding both effects in your head at once.

For a search index or a word-frequency study, the gains dominate and the losses are irrelevant — lowercase away. For named-entity recognition or emphasis-sensitive sentiment, the losses can be fatal — be careful, or skip the step. Same operation, opposite verdict, depending on the task.

Your turn: write a normalizer

Write a function normalize(text) that returns a list of normalized words:

Lowercase the whole string.
Remove punctuation characters (the ones in string.punctuation).
Split on whitespace and drop any empty strings.

For example, normalize("Hello, World!") should return ['hello', 'world']. The string module is already imported for you.

(Heads up: because step 2 removes the apostrophe, a contraction like "it's" will become "its". That is a real side effect of character-level punctuation stripping — you do not need to fix it here, just be aware of it.)

Check your understanding

QuestionSelect one

For a system that counts how often each word appears in a news article, why is lowercasing usually the right call?

It makes the article shorter

Different capitalizations of the same word ("The", "the", "THE") are the same word for counting purposes, so folding them together gives accurate totals

It translates the article into another language

It is required before any text can be stored

QuestionSelect one

A team building a system to detect company names in news ("Apple announced…") lowercases all text as the very first step and then complains the system confuses companies with common nouns. What went wrong?

Lowercasing is too slow for news text

Capitalization is a key signal for recognizing names, and lowercasing first destroyed it — for this task, case folding is harmful

The system needed more punctuation, not less

News articles cannot be processed by computers

QuestionSelect one

You strip punctuation by keeping only tokens where token.isalpha() is True. On the token "World!" what happens, and why might that surprise you?

"World!" becomes "World" with the "!" removed

The entire token "World!" is dropped, because the whole string is not alphabetic — so you lose the word, not just the punctuation

It raises an error

It splits into "World" and "!"

QuestionSelect one

Why should punctuation stripping generally come after sentence tokenization, not before?

Punctuation cannot be removed from a tokenized list

Sentence tokenizers use periods, question marks, and exclamation points to find sentence boundaries; removing them first destroys those clues and can merge separate sentences

Stripping punctuation changes the language of the text

It does not matter; order is irrelevant here

Case folding and punctuation stripping shrink the number of distinct spellings. But two related words like "running" and "runs" still look unrelated, and ultra-common words like "the" still dominate the counts. We tackle the common words next: stopwords.

Case folding: collapsing "The", "the", and "THE"

When you should NOT lowercase

Punctuation stripping

What normalization gives and takes

Your turn: write a normalizer

Check your understanding

Text Normalization: Case Folding and Punctuation

Case folding: collapsing "The", "the", and "THE"

When you should NOT lowercase

Punctuation stripping

What normalization gives and takes

Your turn: write a normalizer

Check your understanding

On this page