Probability Distributions
The key probability distributions — discrete and continuous — what they model, and when to reach for each.
What a Distribution Is
A probability distribution describes all possible values a random variable can take and how likely each is.
Discrete distributions: the variable takes specific, countable values (integers). Described by a probability mass function (PMF): P(X = x).
Continuous distributions: the variable can take any value in a range. Described by a probability density function (PDF): f(x), where probability is area under the curve, not the height.
P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx
For any continuous distribution, P(X = exactly x) = 0 — you can only ask about ranges.
Key Properties
Expected value (mean):
Discrete: E[X] = Σ x · P(X = x)
Continuous: E[X] = ∫ x · f(x) dx
Variance — average squared deviation from the mean:
Var(X) = E[(X − μ)²] = E[X²] − (E[X])²
Standard deviation: σ = √Var(X) — same units as X.
Discrete Distributions
Bernoulli
One trial, two outcomes — success (p) or failure (1−p).
P(X = 1) = p
P(X = 0) = 1 − p
E[X] = p, Var(X) = p(1−p)
The building block. Everything else is made of Bernoulli trials.
Binomial — B(n, p)
n independent Bernoulli trials, count the successes.
P(X = k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ
E[X] = np
Var(X) = np(1−p)
When to use: fixed number of independent trials, each with the same success probability. Number of heads in 20 flips, defective items in a batch, correct answers guessed on a test.
10 coin flips, p = 0.5. P(exactly 6 heads)?
P(X=6) = C(10,6) × 0.5⁶ × 0.5⁴ = 210 × (1/1024) ≈ 0.205
Geometric
Number of Bernoulli trials until the first success.
P(X = k) = (1−p)^(k−1) × p
E[X] = 1/p
When to use: waiting for the first success. Number of calls until a sale, number of attempts until a password guess succeeds.
Poisson — Pois(λ)
Number of events in a fixed interval when events occur independently at a constant average rate λ.
P(X = k) = (λᵏ × e⁻λ) / k!
E[X] = λ
Var(X) = λ (mean equals variance — defining property)
When to use: counts of rare, independent events in time or space. Emails per hour, accidents per week, mutations per genome, customers per minute at a queue.
Average 3 calls per hour. P(exactly 5 calls in one hour)?
P(X=5) = (3⁵ × e⁻³) / 5! = (243 × 0.0498) / 120 ≈ 0.101
Poisson as a limit of Binomial: when n is large and p is small (rare events), Binomial(n,p) ≈ Poisson(np). Useful approximation.
Continuous Distributions
Uniform — U(a, b)
Every value in [a, b] equally likely.
f(x) = 1/(b−a) for a ≤ x ≤ b
E[X] = (a+b)/2
Var(X) = (b−a)²/12
When to use: genuinely no preference among values in a range. Random number generators, rounding errors.
Exponential — Exp(λ)
Time between events in a Poisson process. The continuous analogue of the geometric distribution.
f(x) = λe^(−λx) for x ≥ 0
E[X] = 1/λ
Var(X) = 1/λ²
Memoryless property: P(X > s + t | X > s) = P(X > t). The distribution has no memory — it doesn’t matter how long you’ve been waiting, the remaining wait has the same distribution. The exponential is the only continuous distribution with this property.
When to use: time until next event (next failure, next arrival, next earthquake). Service times, radioactive decay, component lifetimes.
Normal (Gaussian) — N(μ, σ²)
The bell curve. Parameterised by mean μ and standard deviation σ.
f(x) = (1/σ√2π) × exp(−(x−μ)²/2σ²)
E[X] = μ
Var(X) = σ²
Standard normal N(0,1): mean 0, standard deviation 1. Every normal can be converted to it:
Z = (X − μ) / σ
The 68-95-99.7 rule:
μ ± 1σ: 68.3% of the distribution
μ ± 2σ: 95.4%
μ ± 3σ: 99.7%
When to use: sums and averages of many independent variables (Central Limit Theorem). Measurement errors, heights, IQ scores, financial returns (approximately). The normal is the right model when many small independent effects add up.
Log-Normal
If ln(X) is normally distributed, X is log-normal. Skewed right, values strictly positive.
When to use: quantities that grow multiplicatively — incomes, city populations, stock prices, biological quantities. When you’re taking logs and the result looks normal, the original is log-normal.
Power Law
P(X > x) ∝ x⁻α. Heavy tail — extreme events are far more likely than the normal distribution predicts.
When to use: wealth distribution, city sizes, earthquake magnitudes, internet traffic, word frequencies (Zipf’s law), social network degrees. Any domain with “winner takes most” dynamics.
The signature: on a log-log plot, a power law appears as a straight line.
Choosing a Distribution
| Situation | Distribution |
|---|---|
| Count of successes in n trials | Binomial |
| Count of rare events in an interval | Poisson |
| Trials until first success | Geometric |
| Time between Poisson events | Exponential |
| Sum of many independent effects | Normal |
| Multiplicative growth, positive skew | Log-Normal |
| Winner-takes-most, heavy tail | Power Law |
| No prior preference over a range | Uniform |
The question to ask: what generates this quantity? Many small additive effects → Normal. Independent rare events → Poisson. Time between events → Exponential. Multiplicative processes → Log-Normal.
The CDF — Cumulative Distribution Function
The CDF F(x) = P(X ≤ x) gives the probability of being at or below x.
P(a ≤ X ≤ b) = F(b) − F(a)
For the normal distribution, probabilities are read from standard normal tables (or computed with erf). Most statistical software handles this directly. The key values to know:
P(Z ≤ 1.645) ≈ 0.95 (one-tailed 95%)
P(Z ≤ 1.96) ≈ 0.975 (two-tailed 95%, each tail 2.5%)
P(Z ≤ 2.576) ≈ 0.995 (two-tailed 99%)