What Is Natural Language Processing?

Imagine you hand a computer this sentence and ask, innocently, "how many words is that?"

"I can't believe Dr. Smith's startup raised $2.5M — it's unreal!"

You already sense the trouble. Is "can't" one word or two ("can" + "not")? Does the period after "Dr" end a sentence? Is "$2.5M" a word? Is the em dash a word? A human reads this in a second and understands it perfectly. A computer sees a flat string of characters with no idea where words begin, which marks matter, or what any of it means. Natural Language Processing (NLP) is the field that closes that gap.

The core problem: text is unstructured

Most data a program handles is structured: a number is a number, a date has a year-month-day, a database column has a known type. Text is unstructured. It is just a sequence of characters, and all of its meaning — the words, the sentences, the grammar, the intent — is implied rather than labeled. Before a program can do anything useful with text, that implicit structure has to be made explicit.

That is the one-sentence mission of this entire course: turn unstructured text into structure a program can compute with. Everything else — tokenization, normalization, tagging — is a specific technique for adding a specific kind of structure.

A working definition

Natural Language Processing is the set of techniques for getting computers to work with human language — to break it into pieces, measure it, compare it, and extract meaning from it. "Natural" language (English, Spanish, Japanese) is contrasted with the "formal" languages computers were designed for, like Python or SQL, which are unambiguous by construction.

Why this is hard: ambiguity everywhere

Formal languages are designed so that every statement has exactly one meaning. Human language is the opposite: it is riddled with ambiguity, and humans resolve it so effortlessly with context that we rarely notice it is there. A computer has no such instincts. It is worth seeing the main kinds of ambiguity explicitly, because nearly every NLP technique exists to fight one of them.

Lexical ambiguity is when a single word has multiple meanings. "Lead" can be a heavy metal or the act of guiding. "Bank" can be a riverside or a place for money. The word is identical; only context decides.

Structural (syntactic) ambiguity is when the same words can be grouped into different grammatical structures. "I saw the man with the telescope" — did you use a telescope to see him, or did you see a man who was holding one? Both readings use the exact same words in the exact same order.

Referential ambiguity is when it is unclear what a word points to. "The trophy didn't fit in the case because it was too big." What was too big — the trophy or the case? A human knows instantly (the trophy). A computer has no built-in notion of which object would plausibly be "too big."

The mental shift

Stop thinking of text as words and start thinking of it as characters with hidden structure you must reconstruct. The hidden structure is where the words are, where the sentences end, which word is the verb, and what refers to what. NLP is the work of reconstructing that structure, one layer at a time.

Text is also irregular and noisy

Beyond ambiguity, real text is just messy in ways that break naive programs:

Contractions hide words: "don't" contains "do" + "not"; "we'll" contains "we" + "will".
Punctuation is overloaded: a period ends a sentence and abbreviates "Dr." and marks a decimal in "2.5".
Casing is inconsistent: "Apple" the company vs. "apple" the fruit, or a word capitalized only because it starts a sentence.
Morphology: "run", "runs", "running", and "ran" are four spellings of one underlying idea.
Multi-word units: "New York", "machine learning", and "hot dog" are each one concept written as several words.

Every one of these is a reason a later page exists. Contractions and punctuation motivate tokenization. Casing motivates case folding. Morphology motivates stemming and lemmatization. Keep this list in mind — we are going to dismantle it problem by problem.

A first look: why `.split()` is not enough

Your instinct, as a Python programmer, is probably to reach for text.split() to break a sentence into words. It is a reasonable first guess, and seeing exactly where it fails is the perfect motivation for what comes next. Run this and read the output carefully.

Look at what .split() produced. "can't" stayed glued together, so the hidden "not" is invisible. "butter," carries a comma stuck to it, so the word "butter" and the word "butter," would be counted as two different words. "Jones!" has an exclamation mark fused on. The naive split has no idea that punctuation is separate from words.

Now watch a real tokenizer handle the same sentence. Do not worry about the setup — it is loaded for you in the read-only section above the editor.

The tokenizer split "can't" into ca + n't, exposing the negation. It peeled the comma off "butter" and the exclamation mark off "Jones". It even left "Mr." intact rather than treating that period as a word. We will study exactly how it does this on the tokenization page — for now, the point is simply that breaking text into words is a real problem, not a one-liner.

QuestionSelect one

Why does sentence.split() give a misleading word list for the sentence above?

It removes all the vowels from each word

It only splits on whitespace, so punctuation stays attached to words and contractions are never separated

It splits every character into its own item

It automatically lowercases the text, changing the words

So what does an NLP system actually do?

It almost never tackles "understand this text" in one giant leap. Instead it adds structure in layers, each layer building on the last. A typical analysis pipeline looks like this:

Each arrow is a technique you will learn. By the end you will be able to walk a paragraph all the way from the left of that diagram to the right, and — just as importantly — explain why each arrow is there and when you would skip it.

A common misconception: NLP means AI that 'understands'

It is tempting to imagine NLP as a system that reads text and understands it the way you do. The classic techniques in this course do nothing of the kind. They count, match, and transform — mechanically and without comprehension. A frequency counter does not know what a word means; it knows how often it appears. This is not a limitation to apologize for: an enormous amount of useful work (search, spam filtering, topic detection) is done by systems that never "understand" a thing. Knowing the difference between processing text and understanding it will keep your expectations honest.

Rules versus learning (a brief orientation)

There are two broad strategies for handling language, and this course leans firmly toward the first.

Rule-based / algorithmic approaches encode human knowledge directly: a list of stopwords, a set of suffix-stripping rules for stemming, a dictionary of positive and negative words for sentiment. They are transparent — you can read exactly why they did what they did — and they need no training data.

Learning-based approaches (machine learning, and the deep-learning models behind modern assistants) infer their behavior from large quantities of example text instead of hand-written rules. They are powerful but opaque, and they assume you already understand the structure of text.

We focus on classic, rule-based and count-based NLP because it is where the durable intuition lives. Once you understand why text needs tokenizing, normalizing, and tagging, the modern learning-based tools become far easier to pick up — they still rely on most of these same ideas under the hood.

Where NLP shows up in the real world

The techniques in this course are not academic curiosities. They quietly power tools you use every day:

Search engines tokenize and normalize both your query and billions of documents so that searching "running" can also find "runs" and "ran".
Spam filters count suspicious words and phrases to score an email.
Autocomplete and predictive text lean on n-grams — the probability of the next word given the last one or two.
Voice assistants convert messy spoken input into tokens before doing anything else.
Sentiment dashboards scan product reviews or social posts for positive and negative language.
Plagiarism and duplicate detection compare normalized word sets and n-grams between documents.

In every case, the first thing that happens — before any cleverness — is the unglamorous work of turning raw text into clean, comparable units. That is what you are about to learn to do well.

Check your understanding

QuestionSelect one

A program reads the word "bank" and cannot decide whether it means a financial institution or the side of a river. This is an example of which kind of ambiguity?

Lexical ambiguity

Structural ambiguity

Referential ambiguity

There is no ambiguity here

QuestionSelect one

Which statement best captures what classic NLP techniques (like the ones in this course) actually do?

They make the computer genuinely understand the meaning of text the way a human does

They mechanically transform, match, and count text to expose structure — without any real comprehension

They translate all text into mathematical equations that are always exactly correct

They only work on programming languages, not human languages

QuestionSelect one

"I saw the man with the telescope" can mean either I used a telescope to see him or I saw a man who had a telescope. What kind of ambiguity is this?

Lexical ambiguity

Structural (syntactic) ambiguity

Referential ambiguity

Morphological ambiguity

QuestionSelect one

Why is "turn unstructured text into structure" a fair one-line summary of NLP's core problem?

Because text files are larger than other kinds of data

Because raw text is just a sequence of characters with all its meaning implied, and programs need that implied structure (words, sentences, tags) made explicit before they can compute with it

Because text always needs to be translated into another language first

Because computers cannot store text at all

QuestionSelect one

A teammate proposes counting the words in customer reviews using review.split() and tallying the results. What is the most important risk?

Splitting is far too slow to run on text

Punctuation stays attached to words and contractions stay glued, so "great!", "great", and "great." are counted as three different words and counts are distorted

split() deletes the reviews from memory

split() only works on numbers

In the next section we will zoom out and look at the whole pipeline as a single picture — the standard sequence of steps that text flows through — before we dive into each step in detail.

The core problem: text is unstructured

Why this is hard: ambiguity everywhere

Text is also irregular and noisy

A first look: why .split() is not enough

So what does an NLP system actually do?

Rules versus learning (a brief orientation)

Where NLP shows up in the real world

Check your understanding

What Is Natural Language Processing?

On this page