Missing Values (NA)
Real data is full of holes. R has a first-class concept — `NA` — for representing "I don't know," and a small set of rules for working with it correctly.
In real datasets, things go missing. A survey respondent skips a
question. A sensor fails for a day. A field on a form is left
blank. R has a special value to represent this: NA, which
stands for Not Available.
NA is not the same as zero, and it is not an empty
string. It is R's way of saying: we don't have a value here.
How NA propagates
The single most important rule about NA: most computations
involving NA return NA. R is being careful — if you don't
know one of the inputs, you don't know the answer either.
Notice that even NA == NA returns NA. That trips up almost
everyone the first time. Think of it this way: if I don't know
either value, I cannot tell you whether they're equal.
This is why R has a dedicated function for "is this value missing?":
is.na() is the way to test for missing values. Never use == NA — it will silently give you all NAs.
Summary functions and the na.rm argument
Most summary functions (mean, sum, sd, min, max,
median) accept an na.rm argument. Setting na.rm = TRUE
quietly removes NAs before computing.
This is deliberately not the default. R's designers wanted you to consciously choose to ignore missingness, rather than have it silently swept under the rug. It is one of the language's most important small design decisions.
Removing or replacing NAs
Sometimes you want to remove NA values entirely from a vector
before doing anything else:
Other times, you want to replace NAs with some sensible value (zero, the mean, the last known value):
Be careful with the second pattern (mean imputation). It is a real statistical decision, not just a cleaning step. Imputing the mean understates the variability in your data and can bias downstream analyses. Use it knowingly, not by default.
Missingness has meaning
Here is the most important idea in this entire page, and probably the most under-appreciated idea in beginner data work: the fact that a value is missing is itself information.
If 30% of survey respondents skipped the "income" question, that is not noise — that's a signal. Maybe people in certain income brackets are more likely to skip. Maybe the question was poorly worded. Maybe there's a UI bug. Either way, you cannot find out unless you look at the missingness before deciding what to do.
For now, the practical takeaway is: always check for NAs
before computing summaries, and always make a conscious choice
about how to handle them.
A close cousin: NaN, Inf, and NULL
R has a few other "special" values that beginners sometimes
confuse with NA:
NaN= "Not a Number." Comes from operations like0/0. It is technically of numeric type.is.na(NaN)returnsTRUE.Infand-Inf= positive and negative infinity. Come from operations like1/0. Not the same as missing.NULL= "nothing at all" — an empty object, length 0. Not the same asNA. Used most often to mean "there is no value here, not even a missing one."
You will mostly only see NA in your day-to-day data. The others
are good to recognize when they appear, but you don't need to
work with them constantly.
Test your understanding
What does mean(c(1, 2, 3, NA)) return?
2
NA
An error
0
Which of these is the correct way to test whether x is missing?
x == NA
x = NA
is.na(x)
x !== NA
What does sum(c(10, NA, 20), na.rm = TRUE) return?
NA
30
10
20
Mini challenge: clean and summarize
You have temperature readings for one week. Some days the sensor
failed, leaving NA. Compute:
n_missing: the number of missing readingsavg_temp: the average of the available readings (ignoring NAs)
Given the week vector, set n_missing to the count of NAs and avg_temp to the mean of the non-NA values.
That completes our tour of vectors. Next we step up to the two-dimensional star of R: the data frame, where every column is a vector and every row is a record.
Logical and Character Vectors
Two specialized vector types that power filtering, categorization, and labeling — the bread and butter of real data work.
Data Frames
R's spreadsheet-on-steroids. A data frame is just a collection of equal-length vectors — but that simple idea is enough to organize 90% of the data you'll ever work with.