Text Normalization: Case Folding and Punctuation
Why we lowercase text and strip punctuation — to make tokens that should be equal actually compare equal — and what that normalization quietly throws away. When case folding helps, when it destroys meaning (acronyms, names, shouting), and how to strip punctuation safely.
After tokenization you have a list of word-shaped strings. But to a computer,
"The", "the", and "THE" are three completely different strings, and
"dog" and "dog." are two. If you counted words right now, you would
triple-count "the" and miss matches constantly. Normalization is the step
that reduces this surface-level variation so that tokens which should be
treated as the same actually are.
The two workhorse normalization steps — and the focus of this page — are case folding (lowercasing) and punctuation stripping. Both are simple to do and surprisingly consequential to get wrong.
Case folding: collapsing "The", "the", and "THE"
Computers compare strings by their exact characters, so "The" == "the" is
False. For most counting and matching tasks, that distinction is noise: you
do not care that one "the" started a sentence and another did not. Case
folding — usually just str.lower() — collapses all of them into one form.
Six "different" words became two. If your goal is to count how often each word appears, that collapse is exactly right — "The" at the start of a sentence and "the" in the middle are the same word, and you want them tallied together.
lower() vs casefold()
Python has two lowercasing methods. str.lower() is the everyday choice.
str.casefold() is more aggressive and handles some non-English cases — most
famously the German "ß", which casefold() turns into "ss". For English text
they behave identically; reach for casefold() when you are doing
case-insensitive matching across many languages. Throughout this course we
use lower().
When you should NOT lowercase
Here is where beginners get burned: lowercasing is lossy, and sometimes the case carries real information you need. Folding it away is a one-way door.
- Named entities. "Apple" the company and "apple" the fruit are different things, and capitalization is the main clue. Lowercase first and you have thrown away the signal a name-recognition system relies on.
- Acronyms. "US" (United States) becomes "us" (the pronoun). "WHO" (the health organization) becomes "who". These collisions can be serious.
- Sentiment and emphasis. "this is FINE" shouted in all caps carries different emotional weight than "this is fine". For some sentiment tasks, ALL-CAPS is a feature worth keeping, not erasing.
- Part-of-speech tagging. Taggers use capitalization as a hint that a word is a proper noun. Lowercasing before tagging removes that hint and can make the tagger worse — a reason to tag first, normalize later.
Normalization is information destruction (on purpose)
Every normalization step deletes information so that the remaining tokens compare more easily. That is its entire job — but it means the step can never be undone, and if the information you deleted mattered to your task, you have quietly made things worse. Always ask: what am I throwing away, and does my task need it? For plain word-counting, casing is noise. For entity recognition, casing is signal.
Punctuation stripping
The second normalization workhorse is removing punctuation. After
tokenization you often have standalone punctuation tokens (",", "!",
".") and sometimes punctuation still clinging to words. For most
bag-of-words style tasks, punctuation is noise you want gone. There are a few
common ways to strip it; know more than one.
Compare the three results carefully — they disagree, and the disagreements are instructive:
- Approach 1 (
isalphafilter) drops "World!" entirely because the whole token "World!" is not alphabetic. That is probably not what you wanted — you lost the word, not just the punctuation. - Approach 2 (
translate) removes punctuation characters, turning "World!" into "world" and "It's" into "its". It keeps the words but glues contractions together ("its"). - Approach 3 (regex
[a-z]+) extracts runs of letters, so "isn't" becomes two tokens, "isn" and "t".
There is no single right answer; each approach makes a different trade-off about contractions and partial-punctuation tokens. The lesson is to look at what your stripping actually does rather than trusting it blindly.
The order trap, again
Strip punctuation after tokenizing, not before. If you delete all the periods first, you destroy the very clues a sentence tokenizer needs to find sentence boundaries — and you may even merge two sentences into one run-on. Tokenize, then clean.
Why is lowercasing described as a lossy operation?
Because it makes the text take up less memory
Because it permanently discards capitalization, which sometimes carries real meaning (e.g., "Apple" vs "apple", "US" vs "us"), and that information cannot be recovered afterward
Because it always introduces spelling errors
Because it converts the text into a different language
What normalization gives and takes
It is worth holding both effects in your head at once.
For a search index or a word-frequency study, the gains dominate and the losses are irrelevant — lowercase away. For named-entity recognition or emphasis-sensitive sentiment, the losses can be fatal — be careful, or skip the step. Same operation, opposite verdict, depending on the task.
Your turn: write a normalizer
Write a function normalize(text) that returns a list of normalized words:
- Lowercase the whole string.
- Remove punctuation characters (the ones in
string.punctuation). - Split on whitespace and drop any empty strings.
For example, normalize("Hello, World!") should return ['hello', 'world']. The string module is already imported for you.
(Heads up: because step 2 removes the apostrophe, a contraction like "it's" will become "its". That is a real side effect of character-level punctuation stripping — you do not need to fix it here, just be aware of it.)
Check your understanding
For a system that counts how often each word appears in a news article, why is lowercasing usually the right call?
It makes the article shorter
Different capitalizations of the same word ("The", "the", "THE") are the same word for counting purposes, so folding them together gives accurate totals
It translates the article into another language
It is required before any text can be stored
A team building a system to detect company names in news ("Apple announced…") lowercases all text as the very first step and then complains the system confuses companies with common nouns. What went wrong?
Lowercasing is too slow for news text
Capitalization is a key signal for recognizing names, and lowercasing first destroyed it — for this task, case folding is harmful
The system needed more punctuation, not less
News articles cannot be processed by computers
You strip punctuation by keeping only tokens where token.isalpha() is
True. On the token "World!" what happens, and why might that surprise
you?
"World!" becomes "World" with the "!" removed
The entire token "World!" is dropped, because the whole string is not alphabetic — so you lose the word, not just the punctuation
It raises an error
It splits into "World" and "!"
Why should punctuation stripping generally come after sentence tokenization, not before?
Punctuation cannot be removed from a tokenized list
Sentence tokenizers use periods, question marks, and exclamation points to find sentence boundaries; removing them first destroys those clues and can merge separate sentences
Stripping punctuation changes the language of the text
It does not matter; order is irrelevant here
Case folding and punctuation stripping shrink the number of distinct spellings. But two related words like "running" and "runs" still look unrelated, and ultra-common words like "the" still dominate the counts. We tackle the common words next: stopwords.
Sentence and Word Tokenization
Why breaking text into words and sentences is a real linguistic problem, not a call to .split(). How word_tokenize handles contractions and punctuation, how sentence tokenization survives 'Dr.' and decimals, and when tokenization choices matter downstream.
Identifying and Removing Stopwords
What stopwords are, why removing them can sharpen a topic analysis — and why removing them can quietly destroy a sentiment analysis. The negation trap, domain-specific stoplists, and how to decide whether to filter at all.