Probability Basics — Events, Independence, and Conditional Probability — Bench

What Probability Is

Probability assigns a number between 0 and 1 to an event, measuring how likely it is to occur.

P = 0: impossible
P = 1: certain
P = 0.5: equally likely to occur or not

Sample space (S): the set of all possible outcomes. Event (A): a subset of the sample space — outcomes where A occurs.

Roll a die: S = {1, 2, 3, 4, 5, 6}
Event A = "roll an even number" = {2, 4, 6}
P(A) = 3/6 = 1/2

The Axioms of Probability

All of probability theory follows from three axioms (Kolmogorov, 1933):

P(A) ≥ 0 for any event A
P(S) = 1 (something must happen)
If A and B are mutually exclusive: P(A ∪ B) = P(A) + P(B)

Everything else — conditional probability, independence, Bayes — is derived from these.

Basic Rules

Complement rule:

P(not A) = P(Aᶜ) = 1 − P(A)

Often easier to calculate the complement and subtract from 1.

Addition rule (general):

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Addition rule (mutually exclusive): if A and B can’t both happen:

P(A ∪ B) = P(A) + P(B)

Multiplication rule (general):

P(A ∩ B) = P(A) × P(B | A)

Multiplication rule (independent): if A and B don’t affect each other:

P(A ∩ B) = P(A) × P(B)

Mutually Exclusive vs Independent

These are different — and frequently confused.

Mutually exclusive: A and B cannot both occur. P(A ∩ B) = 0.

Rolling a 3 and rolling a 5 on one die — can’t happen together

Independent: knowing A occurred doesn’t change the probability of B.

Rolling a 3 on the first die doesn’t affect the second die

Mutually exclusive events with nonzero probability are never independent — if A occurs, you know B definitely didn’t, so B’s probability went from P(B) to 0. Knowing A gave you information about B.

Conditional Probability

P(B | A) is the probability of B given that A has occurred. We update the probability based on new information.

P(B | A) = P(A ∩ B) / P(A)

Roll a die. Given it's even, what's P(it's a 4)?

P(even) = 3/6 = 1/2
P(4 and even) = P(4) = 1/6
P(4 | even) = (1/6) / (1/2) = 1/3

This makes sense — among the three even outcomes {2, 4, 6}, one is a 4.

The Law of Total Probability

If B₁, B₂, …, Bₙ partition the sample space (exhaustive and mutually exclusive):

P(A) = Σ P(A | Bᵢ) × P(Bᵢ)

You can calculate P(A) by conditioning on each possible “cause”:

A disease affects 1% of the population.
Test is 95% accurate for sick, 90% accurate for healthy.
What's P(test positive)?

P(+) = P(+|sick)×P(sick) + P(+|healthy)×P(healthy)
     = 0.95×0.01 + 0.10×0.99
     = 0.0095 + 0.099
     = 0.1085

Independence — Formal Definition

A and B are independent if:

P(A ∩ B) = P(A) × P(B)

Equivalently: P(B | A) = P(B) — knowing A gives no information about B.

Testing independence:

P(A) = 0.4, P(B) = 0.3, P(A ∩ B) = 0.12
0.4 × 0.3 = 0.12 ✓ → independent

Multiple independent events: for n independent events:

P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) × P(A₂) × ... × P(Aₙ)

The birthday problem: with 23 people in a room, P(at least two share a birthday) > 50%. Counterintuitive — because you’re counting all pairs, not comparing to one specific person.

P(all different) = 365/365 × 364/365 × 363/365 × ... × 343/365
≈ 0.493
P(at least one match) ≈ 0.507

Conditional Independence

A and B are conditionally independent given C if:

P(A ∩ B | C) = P(A | C) × P(B | C)

Two events can be independent unconditionally but dependent given C, or vice versa. This distinction matters enormously in Bayesian networks and causal reasoning.

Simpson’s Paradox: a treatment can appear better in every subgroup but worse overall — because the subgroup sizes differ. Conditioning on the right variable flips the conclusion. This is why controlling for confounders in statistics is non-negotiable.

Probability Trees

A tree diagram maps out all possible outcomes with their probabilities at each branch. Multiply along branches (AND), add across branches (OR).

Bag A: 3 red, 2 blue. Bag B: 1 red, 4 blue.
Pick a bag randomly, then a ball.

P(red) = P(red|A)×P(A) + P(red|B)×P(B)
       = (3/5)(1/2) + (1/5)(1/2)
       = 3/10 + 1/10 = 4/10 = 0.4

Expected Value

The expected value E[X] is the probability-weighted average outcome:

E[X] = Σ xᵢ × P(X = xᵢ)

Roll a fair die:
E[X] = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6) = 3.5

Expected value is not necessarily a possible outcome. The “expected” number of heads in 10 flips is 5 — a perfectly sensible answer.

Linearity of expectation: E[X + Y] = E[X] + E[Y], always — even when X and Y are not independent. This is one of the most useful properties in probability.

The Gambler’s Fallacy

Past independent events don’t affect future ones. If a fair coin lands heads 10 times in a row, the next flip is still 50/50. The coin has no memory.

The fallacy is real and persistent — casinos display roulette history boards precisely because people believe it. Probability of independent events doesn’t change based on history.

What is true: over a very large number of flips, the proportion of heads converges to 0.5 (Law of Large Numbers). But “due” doesn’t exist.