Sentence and Word Tokenization
Why breaking text into words and sentences is a real linguistic problem, not a call to .split(). How word_tokenize handles contractions and punctuation, how sentence tokenization survives 'Dr.' and decimals, and when tokenization choices matter downstream.
Tokenization is the very first transformation in almost every pipeline, and it is the one beginners most often underestimate. A token is a single meaningful unit of text — usually a word, but also a number, a punctuation mark, or a symbol. Tokenization is the process of cutting a raw string into that list of tokens. It sounds trivial. It is not.
The reason it is not trivial is everything we saw on the first page: contractions hide words, punctuation glues itself to words, and a period means three different things. A good tokenizer encodes a surprising amount of linguistic knowledge to get the cuts right.
The misconception to unlearn first
"Tokenization is just string.split()." This is the single most common
misunderstanding in beginner NLP. Splitting on whitespace is a crude
approximation that fails the moment punctuation or contractions appear — and
they always appear. Real tokenization makes linguistically informed cuts:
it knows that the comma in "butter," is not part of the word, and that "n't"
in "don't" is a separate, meaningful unit.
Word tokenization: where .split() breaks
Let us put the naive approach and a real tokenizer side by side on one tiny, nasty input.
split() produced two items, both subtly broken: the negation inside "don't"
is invisible, and "stop!" would never match the word "stop". The tokenizer
produced four clean units, and crucially it pulled "n't" out as its own token
— which matters enormously for tasks like sentiment analysis, where negation
flips meaning.
See it run for real. The setup section above the editor loads the tokenizer for you.
Study the differences token by token:
"don't"becomes['do', "n't"]— the negation is now a separate token."Lee's"becomes['Lee', "'s"]— the possessive is split off."we'll"becomes['we', "'ll"]— the future-tense marker is exposed.","and"!"become their own tokens instead of clinging to words.
None of that happens with split(). The tokenizer is applying rules learned
from how English is actually written.
How does it know? (a peek, not a deep dive)
NLTK's default word tokenizer is based on the Penn Treebank conventions —
a fixed set of regular-expression rules built from a large hand-annotated
corpus of English. The rules say things like "split a leading or trailing
quote", "separate n't, 're, 'll, 's", and "keep decimal numbers
together". You do not need to memorize the rules. The lesson is that
tokenization is encoded linguistic knowledge, which is exactly why a
one-line split() cannot replace it.
After word_tokenize("don't"), the result is ['do', "n't"]. Why is
splitting the contraction this way valuable for something like sentiment
analysis?
It makes the text shorter and therefore faster to process
It exposes the negation "n't" as its own token, so a later step can detect that the sentence is being negated
It converts the word into a number
It removes the word entirely, which is what we want
Sentence tokenization: where splitting on "." breaks
The mirror-image problem appears one level up. To split a paragraph into
sentences, your instinct might be text.split("."). That fails immediately,
because a period is wildly overloaded: it ends sentences, but it also
abbreviates "Dr." and "U.S.A.", and it marks the decimal in "2.5".
NLTK's sentence tokenizer, Punkt, is smarter. It carries a list of known abbreviations and a model of how sentences typically end, so it does not mistake the period in "Dr." or "2.5" for a sentence boundary. Watch it work.
The naive split shattered "Dr." and "2.5" into nonsense and missed that "?" also ends a sentence. Punkt returned exactly three clean sentences. This is why sentence tokenization is its own step with its own tool, not a string method.
The usual order: sentences first, then words
When a task cares about sentence boundaries (summarization, sentence-level
sentiment, or anything that processes one sentence at a time), the common
pattern is to call sent_tokenize first, then word_tokenize on each
sentence. When you only need a flat bag of words (counting, search indexing),
you can skip straight to word_tokenize on the whole text. Choose based on
whether sentence structure matters to what you are doing.
When the "right" tokens depend on your domain
There is rarely one universally correct tokenization — it depends on what the
text is. Consider social media: in the tweet "Loving #NLP @ the
conference 😀", do you want "#NLP" kept whole as a hashtag, or split into "#"
and "NLP"? Do you want the emoji preserved as a token? The default
word_tokenize will not treat hashtags or emoji specially; NLTK ships a
TweetTokenizer that does.
Notice how the default tokenizer breaks "#NLP" into "#" and "NLP" and may
split the emoticon, while TweetTokenizer keeps the hashtag, handle, and
:) intact. Neither is "wrong" — they serve different goals. The takeaway:
the correct tokenization is the one that preserves the units your task cares
about. This is your first encounter with the recurring theme that pipeline
choices are task-dependent.
Why does text.split(".") fail as a sentence splitter on the text
"Dr. Lee earned 2.5 million."?
Periods are invisible characters that split cannot see
The period is overloaded — it appears in abbreviations ("Dr.") and decimals ("2.5") as well as at sentence ends — so splitting on every period cuts in the wrong places
split can only divide a string into two pieces
The sentence has no periods in it
Where tokenization shows up in the real world
- Search engines tokenize your query and every indexed document so that the words can be matched. If "running" and "running," tokenized differently, your search would silently miss results. Consistent tokenization on both sides is what makes search work at all.
- Spam filters and classifiers count tokens; bad tokenization means counting "free!" and "free" as different words, weakening the signal.
- Machine translation and assistants tokenize input before doing anything else; the quality ceiling of the whole system is partly set here.
- Code and log analysis needs custom tokenizers, because the "words" of a log line or a programming language are not English words.
In all of these, tokenization is invisible when it works and catastrophic when it does not — a classic piece of infrastructure.
Your turn: tokenize a tricky paragraph
A short paragraph is provided in the variable text. Using the tokenizers
loaded for you:
- Create
sentencesby sentence-tokenizingtextwithsent_tokenize. - Create
tokensby word-tokenizingtextwithword_tokenize.
The paragraph contains the abbreviation "Dr.", the decimal "98.6", and the
contraction "didn't". A correct sentence tokenizer must not break on the
period inside "Dr." or "98.6", and a correct word tokenizer must split
"didn't" so that the negation token "n't" appears in tokens.
Check your understanding
What is a token, in the NLP sense?
Always exactly one English word
A single character of text
A single meaningful unit of text — typically a word, but also a number, punctuation mark, or symbol — produced by tokenization
The numeric ID a database assigns to a row
Which task most clearly calls for sentence tokenization (not just word tokenization)?
Counting how many times the word "data" appears in a document
Building a search index of all the words in a corpus
Summarizing a document by selecting its most important sentences
Lowercasing every word in a document
A teammate tokenizes a set of tweets with the default word_tokenize and is
surprised that "#MachineLearning" is split into "#" and "MachineLearning",
losing the hashtag. What is the best fix?
Give up on tokenizing tweets; it is impossible
Use a tokenizer suited to the domain, such as NLTK's TweetTokenizer, which keeps hashtags, mentions, and emoticons intact
Manually delete every "#" before tokenizing
Switch from Python to a different programming language
Why is "tokenization is just string.split()" a harmful misconception?
split() is actually slower than tokenization
split() only cuts on whitespace, so it leaves punctuation attached to words and never separates contractions — corrupting every count, match, and comparison built on the tokens
split() changes the language of the text
There is no difference; they are the same thing
You now have clean tokens. But "The", "the", and "THE" are still three different strings to a computer, and "running" still looks unrelated to "runs". The next two pages fix that, starting with normalization.
The Anatomy of an NLP Pipeline
The standard sequence of preprocessing steps text flows through — tokenize, normalize, filter, reduce, analyze — seen as one picture. Why it is a pipeline of transformations, why most steps are optional, and why order matters.
Text Normalization: Case Folding and Punctuation
Why we lowercase text and strip punctuation — to make tokens that should be equal actually compare equal — and what that normalization quietly throws away. When case folding helps, when it destroys meaning (acronyms, names, shouting), and how to strip punctuation safely.