Practical R for Beginners

The Age of Data Statistics Before Computers The Rise of Statistical Computing From S to R Why R Matters Today Reproducible Analysis

Thinking in Data Your First R Program R as a Calculator Variables and Assignment

Vectors Everywhere Vectorized Computation Logical and Character Vectors Missing Values (NA)

Data Frames Inspecting a Dataset Subsetting and Filtering Tidy Data Principles

The dplyr Verbs Grouped Analysis Reshaping Data

Summary Statistics Exploring Distributions Relationships Between Variables

Principles of Visualization The ggplot2 Grammar Interpreting Plots

Uncertainty and Variability Sampling and Distributions Intuition for Inference

Writing Your Own Functions Scripts and Projects

Mini Project Walkthrough Next Steps

Missing Values (NA)

Real data is full of holes. R has a first-class concept — `NA` — for representing "I don't know," and a small set of rules for working with it correctly.

In real datasets, things go missing. A survey respondent skips a question. A sensor fails for a day. A field on a form is left blank. R has a special value to represent this: NA, which stands for Not Available.

NA is not the same as zero, and it is not an empty string. It is R's way of saying: we don't have a value here.

How `NA` propagates

The single most important rule about NA: most computations involving NA return NA. R is being careful — if you don't know one of the inputs, you don't know the answer either.

Code Block

R 4.6.0

Notice that even NA == NA returns NA. That trips up almost everyone the first time. Think of it this way: if I don't know either value, I cannot tell you whether they're equal.

This is why R has a dedicated function for "is this value missing?":

Code Block

R 4.6.0

is.na() is the way to test for missing values. Never use == NA — it will silently give you all NAs.

Summary functions and the `na.rm` argument

Most summary functions (mean, sum, sd, min, max, median) accept an na.rm argument. Setting na.rm = TRUE quietly removes NAs before computing.

Code Block

R 4.6.0

This is deliberately not the default. R's designers wanted you to consciously choose to ignore missingness, rather than have it silently swept under the rug. It is one of the language's most important small design decisions.

Removing or replacing NAs

Sometimes you want to remove NA values entirely from a vector before doing anything else:

Code Block

R 4.6.0

Other times, you want to replace NAs with some sensible value (zero, the mean, the last known value):

Code Block

R 4.6.0

Be careful with the second pattern (mean imputation). It is a real statistical decision, not just a cleaning step. Imputing the mean understates the variability in your data and can bias downstream analyses. Use it knowingly, not by default.

Missingness has meaning

Here is the most important idea in this entire page, and probably the most under-appreciated idea in beginner data work: the fact that a value is missing is itself information.

If 30% of survey respondents skipped the "income" question, that is not noise — that's a signal. Maybe people in certain income brackets are more likely to skip. Maybe the question was poorly worded. Maybe there's a UI bug. Either way, you cannot find out unless you look at the missingness before deciding what to do.

For now, the practical takeaway is: always check for NAs before computing summaries, and always make a conscious choice about how to handle them.

A close cousin: `NaN`, `Inf`, and `NULL`

R has a few other "special" values that beginners sometimes confuse with NA:

NaN = "Not a Number." Comes from operations like 0/0. It is technically of numeric type. is.na(NaN) returns TRUE.
Inf and -Inf = positive and negative infinity. Come from operations like 1/0. Not the same as missing.
NULL = "nothing at all" — an empty object, length 0. Not the same as NA. Used most often to mean "there is no value here, not even a missing one."

Code Block

R 4.6.0

You will mostly only see NA in your day-to-day data. The others are good to recognize when they appear, but you don't need to work with them constantly.

Test your understanding

QuestionSelect one

What does mean(c(1, 2, 3, NA)) return?

2

NA

An error

0

QuestionSelect one

Which of these is the correct way to test whether x is missing?

x == NA

x = NA

is.na(x)

x !== NA

QuestionSelect one

What does sum(c(10, NA, 20), na.rm = TRUE) return?

NA

30

10

20

Mini challenge: clean and summarize

You have temperature readings for one week. Some days the sensor failed, leaving NA. Compute:

n_missing: the number of missing readings
avg_temp: the average of the available readings (ignoring NAs)

Challenge

R 4.6.0

Handle a week of readings

Given the week vector, set n_missing to the count of NAs and avg_temp to the mean of the non-NA values.

That completes our tour of vectors. Next we step up to the two-dimensional star of R: the data frame, where every column is a vector and every row is a record.

Logical and Character Vectors

Two specialized vector types that power filtering, categorization, and labeling — the bread and butter of real data work.

Data Frames

R's spreadsheet-on-steroids. A data frame is just a collection of equal-length vectors — but that simple idea is enough to organize 90% of the data you'll ever work with.

On this page

How NA propagates Summary functions and the na.rm argument Removing or replacing NAs Missingness has meaningA close cousin: NaN, Inf, and NULLTest your understanding Mini challenge: clean and summarize