Bayes' Theorem
How to update beliefs with evidence — the formula, the intuition, and why it's the foundation of rational reasoning under uncertainty.
The Formula
Bayes’ theorem connects the probability of a hypothesis given evidence to the probability of the evidence given the hypothesis:
P(H | E) = P(E | H) × P(H) / P(E)
- P(H) — prior: your belief in H before seeing evidence
- P(E | H) — likelihood: probability of seeing this evidence if H is true
- P(E) — marginal probability of evidence (normalising constant)
- P(H | E) — posterior: your updated belief after seeing evidence
Where It Comes From
From the definition of conditional probability:
P(H | E) = P(H ∩ E) / P(E)
P(E | H) = P(H ∩ E) / P(H)
So P(H ∩ E) = P(E | H) × P(H). Substitute into the first:
P(H | E) = P(E | H) × P(H) / P(E)
That’s it. Bayes’ theorem is just the definition of conditional probability, rearranged.
A Worked Example — Medical Testing
A disease affects 1% of the population. A test is 95% sensitive (correctly identifies sick people) and 95% specific (correctly identifies healthy people). You test positive. What’s the probability you actually have the disease?
Most people guess ~95%. The real answer is much lower.
P(disease) = 0.01 (prior — base rate)
P(positive | disease) = 0.95 (sensitivity)
P(positive | healthy) = 0.05 (false positive rate)
P(positive) = P(+|sick)×P(sick) + P(+|healthy)×P(healthy)
= 0.95×0.01 + 0.05×0.99
= 0.0095 + 0.0495 = 0.059
P(disease | positive) = (0.95 × 0.01) / 0.059
= 0.0095 / 0.059
≈ 0.161
You have about a 16% chance of actually having the disease.
Why so low? Because the disease is rare (1%), there are many healthy people generating false positives. The 95% accuracy sounds impressive, but it operates on very different base population sizes.
This is why doctors order confirmatory tests — a single positive on a rare condition is usually a false alarm.
The Three Parts: Prior, Likelihood, Posterior
Prior: what you believed before. Where you start. Ideally based on data (base rates), but sometimes genuinely uncertain.
Likelihood: how well does the hypothesis explain the evidence? P(E|H) measures this. A high likelihood means the evidence was very probable if H is true.
Posterior: what you believe after. The prior updated by the evidence. This becomes the prior for the next piece of evidence.
Bayes is iterative. Every posterior can be used as a new prior when the next piece of evidence arrives. Belief updating is sequential.
The Bayes Factor
A useful reformulation using odds:
Posterior odds = Prior odds × Bayes Factor
Where:
Bayes Factor = P(E | H) / P(E | not H)
The Bayes factor measures how much more likely the evidence is under H than under not-H. A factor > 1 supports H; < 1 weakens it.
BF = 0.95 / 0.05 = 19 (for the medical test above)
Even with a Bayes factor of 19 (strong evidence), the posterior probability is only 16% because the prior was so low (1%). Strong evidence can still yield low confidence if the prior is weak enough.
Intuition — The Base Rate Problem
The medical example reveals the base rate fallacy — ignoring how common or rare something is when evaluating evidence.
Think in natural frequencies, not percentages. In 10,000 people:
- 100 have the disease → 95 test positive (true positives)
- 9,900 are healthy → 495 test positive (false positives)
- Total positives: 590
- P(disease | positive) = 95/590 ≈ 16%
This framing makes the answer obvious. The denominator is dominated by false positives because there are so many healthy people.
Bayesian vs Frequentist Interpretation
Two schools of thought about what probability means:
Frequentist: probability is the long-run frequency of an event over many repetitions. You can’t assign a probability to a one-off event (“what’s the probability it rains tomorrow?”). Parameters are fixed; data is random.
Bayesian: probability is a degree of belief — it can apply to any uncertain proposition. You start with a prior, update with evidence, get a posterior. Parameters are uncertain; you put distributions on them.
Both are internally consistent. Bayesians can handle one-off events and incorporate prior knowledge; frequentists avoid the subjectivity of choosing priors. In practice, most applied statistics uses a mix.
Bayesian Updating in Practice
Suppose you’re trying to figure out if a coin is fair or biased (P(heads) = 0.5 or P(heads) = 0.7).
Start with equal priors: P(fair) = P(biased) = 0.5.
You flip and get heads. Update:
P(fair | H) ∝ P(H | fair) × P(fair) = 0.5 × 0.5 = 0.25
P(biased | H) ∝ P(H | biased) × P(biased) = 0.7 × 0.5 = 0.35
Normalise: P(fair | H) = 0.25/0.60 ≈ 0.42, P(biased | H) ≈ 0.58.
Flip again, get heads. Repeat. After 10 heads: the posterior strongly favours biased. After 10 alternating flips: fair wins back its ground. Evidence accumulates.
Where Bayes Shows Up
Spam filters: P(spam | words in email) — trained on examples of spam and legitimate mail.
Medical diagnosis: P(disease | symptoms, test results, history) — sequentially updated.
Search and rescue: Bayesian search updates the probability map of where a lost object is as each sector comes up empty.
Machine learning: Naïve Bayes classifier, Bayesian neural networks, probabilistic graphical models.
Science: model comparison, parameter estimation, hypothesis testing without p-values.
Everyday reasoning: if you hear hoofbeats, think horses not zebras — the prior on horses is much higher. New evidence (the animal is striped) updates you toward zebra. That’s Bayesian reasoning.
The Core Lesson
The posterior is always a compromise between your prior and the evidence. Strong evidence with a weak prior: moderate confidence. Weak evidence with a strong prior: barely moves. Strong evidence on a strong prior: very high confidence.
This is why extraordinary claims require extraordinary evidence — not as a rhetorical move, but mathematically. A very low prior requires very high likelihood to bring the posterior up to meaningful probability. The formula enforces epistemic humility.