Dataslope logoDataslope

Sentence and Word Tokenization

Why breaking text into words and sentences is a real linguistic problem, not a call to .split(). How word_tokenize handles contractions and punctuation, how sentence tokenization survives 'Dr.' and decimals, and when tokenization choices matter downstream.

Tokenization is the very first transformation in almost every pipeline, and it is the one beginners most often underestimate. A token is a single meaningful unit of text — usually a word, but also a number, a punctuation mark, or a symbol. Tokenization is the process of cutting a raw string into that list of tokens. It sounds trivial. It is not.

The reason it is not trivial is everything we saw on the first page: contractions hide words, punctuation glues itself to words, and a period means three different things. A good tokenizer encodes a surprising amount of linguistic knowledge to get the cuts right.

The misconception to unlearn first

"Tokenization is just string.split()." This is the single most common misunderstanding in beginner NLP. Splitting on whitespace is a crude approximation that fails the moment punctuation or contractions appear — and they always appear. Real tokenization makes linguistically informed cuts: it knows that the comma in "butter," is not part of the word, and that "n't" in "don't" is a separate, meaningful unit.

Word tokenization: where .split() breaks

Let us put the naive approach and a real tokenizer side by side on one tiny, nasty input.

split() produced two items, both subtly broken: the negation inside "don't" is invisible, and "stop!" would never match the word "stop". The tokenizer produced four clean units, and crucially it pulled "n't" out as its own token — which matters enormously for tasks like sentiment analysis, where negation flips meaning.

See it run for real. The setup section above the editor loads the tokenizer for you.

Code Block
Python 3.13.2

Study the differences token by token:

  • "don't" becomes ['do', "n't"] — the negation is now a separate token.
  • "Lee's" becomes ['Lee', "'s"] — the possessive is split off.
  • "we'll" becomes ['we', "'ll"] — the future-tense marker is exposed.
  • "," and "!" become their own tokens instead of clinging to words.

None of that happens with split(). The tokenizer is applying rules learned from how English is actually written.

How does it know? (a peek, not a deep dive)

NLTK's default word tokenizer is based on the Penn Treebank conventions — a fixed set of regular-expression rules built from a large hand-annotated corpus of English. The rules say things like "split a leading or trailing quote", "separate n't, 're, 'll, 's", and "keep decimal numbers together". You do not need to memorize the rules. The lesson is that tokenization is encoded linguistic knowledge, which is exactly why a one-line split() cannot replace it.

QuestionSelect one

After word_tokenize("don't"), the result is ['do', "n't"]. Why is splitting the contraction this way valuable for something like sentiment analysis?

It makes the text shorter and therefore faster to process

It exposes the negation "n't" as its own token, so a later step can detect that the sentence is being negated

It converts the word into a number

It removes the word entirely, which is what we want

Sentence tokenization: where splitting on "." breaks

The mirror-image problem appears one level up. To split a paragraph into sentences, your instinct might be text.split("."). That fails immediately, because a period is wildly overloaded: it ends sentences, but it also abbreviates "Dr." and "U.S.A.", and it marks the decimal in "2.5".

NLTK's sentence tokenizer, Punkt, is smarter. It carries a list of known abbreviations and a model of how sentences typically end, so it does not mistake the period in "Dr." or "2.5" for a sentence boundary. Watch it work.

Code Block
Python 3.13.2

The naive split shattered "Dr." and "2.5" into nonsense and missed that "?" also ends a sentence. Punkt returned exactly three clean sentences. This is why sentence tokenization is its own step with its own tool, not a string method.

The usual order: sentences first, then words

When a task cares about sentence boundaries (summarization, sentence-level sentiment, or anything that processes one sentence at a time), the common pattern is to call sent_tokenize first, then word_tokenize on each sentence. When you only need a flat bag of words (counting, search indexing), you can skip straight to word_tokenize on the whole text. Choose based on whether sentence structure matters to what you are doing.

When the "right" tokens depend on your domain

There is rarely one universally correct tokenization — it depends on what the text is. Consider social media: in the tweet "Loving #NLP @ the conference 😀", do you want "#NLP" kept whole as a hashtag, or split into "#" and "NLP"? Do you want the emoji preserved as a token? The default word_tokenize will not treat hashtags or emoji specially; NLTK ships a TweetTokenizer that does.

Code Block
Python 3.13.2

Notice how the default tokenizer breaks "#NLP" into "#" and "NLP" and may split the emoticon, while TweetTokenizer keeps the hashtag, handle, and :) intact. Neither is "wrong" — they serve different goals. The takeaway: the correct tokenization is the one that preserves the units your task cares about. This is your first encounter with the recurring theme that pipeline choices are task-dependent.

QuestionSelect one

Why does text.split(".") fail as a sentence splitter on the text "Dr. Lee earned 2.5 million."?

Periods are invisible characters that split cannot see

The period is overloaded — it appears in abbreviations ("Dr.") and decimals ("2.5") as well as at sentence ends — so splitting on every period cuts in the wrong places

split can only divide a string into two pieces

The sentence has no periods in it

Where tokenization shows up in the real world

  • Search engines tokenize your query and every indexed document so that the words can be matched. If "running" and "running," tokenized differently, your search would silently miss results. Consistent tokenization on both sides is what makes search work at all.
  • Spam filters and classifiers count tokens; bad tokenization means counting "free!" and "free" as different words, weakening the signal.
  • Machine translation and assistants tokenize input before doing anything else; the quality ceiling of the whole system is partly set here.
  • Code and log analysis needs custom tokenizers, because the "words" of a log line or a programming language are not English words.

In all of these, tokenization is invisible when it works and catastrophic when it does not — a classic piece of infrastructure.

Your turn: tokenize a tricky paragraph

Challenge
Python 3.13.2
Count sentences and expose a contraction

A short paragraph is provided in the variable text. Using the tokenizers loaded for you:

  1. Create sentences by sentence-tokenizing text with sent_tokenize.
  2. Create tokens by word-tokenizing text with word_tokenize.

The paragraph contains the abbreviation "Dr.", the decimal "98.6", and the contraction "didn't". A correct sentence tokenizer must not break on the period inside "Dr." or "98.6", and a correct word tokenizer must split "didn't" so that the negation token "n't" appears in tokens.

Check your understanding

QuestionSelect one

What is a token, in the NLP sense?

Always exactly one English word

A single character of text

A single meaningful unit of text — typically a word, but also a number, punctuation mark, or symbol — produced by tokenization

The numeric ID a database assigns to a row

QuestionSelect one

Which task most clearly calls for sentence tokenization (not just word tokenization)?

Counting how many times the word "data" appears in a document

Building a search index of all the words in a corpus

Summarizing a document by selecting its most important sentences

Lowercasing every word in a document

QuestionSelect one

A teammate tokenizes a set of tweets with the default word_tokenize and is surprised that "#MachineLearning" is split into "#" and "MachineLearning", losing the hashtag. What is the best fix?

Give up on tokenizing tweets; it is impossible

Use a tokenizer suited to the domain, such as NLTK's TweetTokenizer, which keeps hashtags, mentions, and emoticons intact

Manually delete every "#" before tokenizing

Switch from Python to a different programming language

QuestionSelect one

Why is "tokenization is just string.split()" a harmful misconception?

split() is actually slower than tokenization

split() only cuts on whitespace, so it leaves punctuation attached to words and never separates contractions — corrupting every count, match, and comparison built on the tokens

split() changes the language of the text

There is no difference; they are the same thing

You now have clean tokens. But "The", "the", and "THE" are still three different strings to a computer, and "running" still looks unrelated to "runs". The next two pages fix that, starting with normalization.

On this page