Statistics for Data Science with Python

Why Statistics

Why Statistics Matters Statistical Thinking Populations and Samples Types of Data

Describing Data

Measures of Center Measures of Spread Shape and Outliers Visualizing Distributions

Probability

Probability Basics Conditional Probability Random Variables

Probability Distributions

Discrete Distributions Continuous Distributions The Normal Distribution Working with Distributions

Sampling & the CLT

Sampling and Bias Sampling Distributions The Central Limit Theorem Standard Error

Estimation & Confidence Intervals

Confidence Intervals The Bootstrap

Hypothesis Testing

Hypothesis Testing P-values Errors and Power t-Tests ANOVA and Chi-Square Correlation and Nonparametric Tests

Reasoning & Effect Sizes

Effect Sizes Statistical Fallacies

Applied Statistics

Exploratory Statistical Analysis A/B Testing Next Steps

Why Statistics Matters

Why raw data is rarely enough, where uncertainty and randomness come from, and why data science is built on statistical reasoning.

You have the data. It's clean, it's in a DataFrame, the chart renders. So why do you need statistics at all — can't you just read off the answer?

Sometimes, yes. If you want to know how many orders shipped yesterday, you count them. Done. But almost every interesting question in data science isn't about the rows you have — it's about a larger reality you can't see directly. "Will this checkout redesign increase conversion?" "Are our two warehouses really performing differently?" "Is churn going up, or did we just have a noisy month?" For those, the data in front of you is a clue, not the answer. This short page is about why.

Why data alone is often not enough

Raw data answers questions about itself. Statistics answers questions about the world the data came from. Those are different things, and the gap between them is where careers are made or wrecked.

A spreadsheet of 500 surveyed customers can tell you, exactly, how those 500 people answered. But you don't care about those 500 people — you care about the hundreds of thousands you didn't survey. Crossing from "what these 500 said" to "what our customers think" is not a Pandas operation. It's a statistical one, and it can only be done with an honest accounting of uncertainty.

Description vs. inference

Counting, averaging, and charting what you have is description — it's always exactly true about your data. Using that data to make claims about something bigger is inference — and inference is never certain. This course is mostly about doing inference honestly.

Why uncertainty exists in real datasets

Uncertainty isn't a sign that you collected the data badly. It's baked into the situation. It comes from at least three unavoidable sources:

You measured a sample, not everything. You surveyed 500 of 300,000 customers. A different 500 would have given different numbers. That wobble is sampling uncertainty.
Measurement is imperfect. Sensors drift, people round their answers, timestamps get logged in the wrong timezone, a survey question is read two different ways.
The world itself is variable. Real processes have natural spread. Daily sales aren't a constant 100 — they're 95, 110, 88, 130, even if nothing about the business changed.

Code Block

Python 3.13.2

Nothing in that business changed, yet the numbers swing by 30+ units. If you cherry-pick the high day and the low day, you can tell any story you like. Statistics is what stops you.

Why randomness matters

Randomness is the reason a difference between two numbers might mean nothing. This is the single most important habit this course will build: when you see that Group A's average is higher than Group B's, your first thought should not be "A is better" — it should be "could this gap just be luck?"

Code Block

Python 3.13.2

Watch the gap column: +2, then −3, then +1... There's no real difference anywhere, yet the gap is never exactly zero. Random variation manufactures fake differences for free. Your job is to tell those apart from real ones — and that requires a model of how big "normal" random wiggles can get. That model is what probability and sampling distributions give you.

The most common mistake in data work

Treating any difference as meaningful just because it's nonzero. With real data, two groups always differ a little. The question is never "are they different?" — it's "are they different by more than chance can explain?"

Why analysts need statistics: it supports decisions

Data science exists to support decisions: ship or don't ship, investigate or move on, invest more or cut losses. Decisions made on noise are expensive. Statistics is the layer that turns "here's a number" into "here's a number, here's how confident we are, and here's what could still go wrong."

Notice that "describe it" is just one box. A huge share of the value — and almost everything that separates a senior data scientist from a dashboard — lives in the "quantify uncertainty" step. That's the step this course is about.

Why data science leans so heavily on statistical reasoning

Machine learning, experimentation, forecasting, and analytics all rest on the same foundation: reasoning from a limited, noisy sample to a general conclusion. A model trained on last year's data is a sample; asking whether it'll work next year is a statistical question. An A/B test is a textbook hypothesis test. "Is this metric trending or just noisy?" is a question about sampling variability.

The throughline

Nearly every hard question in data science reduces to the same shape: "I observed something in a noisy, partial sample. How much of it is real, and how confident can I be?" Statistical reasoning is the general-purpose tool for that shape — which is why it shows up everywhere.

Check your understanding

QuestionSelect one

You survey 500 customers and find 62% prefer the new design. What is the honest way to describe this result?

Exactly 62% of all our customers prefer the new design

In this sample, 62% preferred it; the true population preference is probably near 62% but uncertain, and we'd want to quantify that uncertainty

The survey proves the new design is better

62% is meaningless because it's only 500 people

QuestionSelect one

Daily sales for an unchanged business are 95, 110, 88, 130, 102. A manager points at 88 and 130 and says "something happened mid-week!" What's the best statistical response?

They're right — a 42-unit swing must have a cause

That swing is well within the normal random variation of a stable process, so there's no reason to invent a cause yet

We should remove 88 and 130 as outliers

We can't say anything without more data

QuestionSelect one

Which of these is a question that description alone can answer, with no statistical inference needed?

Will next month's revenue beat this month's?

Do our premium users churn less than free users in general?

How many orders are in this table that shipped yesterday?

Is the new layout better than the old one?

Carry this forward

For the rest of the course, train one reflex: whenever you see a number, ask "this is a fact about my sample — what is it trying to tell me about the world, and how sure can I be?" Everything else is machinery in service of that question.

Welcome

A practical, intuition-first statistics course for data scientists who already know Python and Pandas — built around reasoning under uncertainty, not memorizing formulas.

Statistical Thinking

The core mindset of the course — treating data as one noisy realization of an underlying process, thinking in distributions instead of point values, and separating signal from noise without fooling yourself.

On this page

Why data alone is often not enough Why uncertainty exists in real datasets Why randomness matters Why analysts need statistics: it supports decisions Why data science leans so heavily on statistical reasoning Check your understanding