Statistics for Machine Learning — Bench

Why ML is Just Statistics Wearing a Different Jacket

Most of what machine learning does can be restated in probabilistic terms. A classifier is estimating a conditional probability P(y | x). A regression model is finding a distribution over outputs. A neural network trained with cross-entropy loss is maximizing likelihood. The engineering framing — layers, optimizers, architectures — is important, but the statistical framing is where you find out why things work and when they’ll fail.

The concepts below are not background knowledge. They are the actual machinery underneath most of what gets called “AI.”

Probability Basics

Random variable: a variable whose value is determined by a random process. Discrete (takes countable values) or continuous (takes values in a range).

Probability distribution: describes how likely each value is.

Discrete: probability mass function (PMF) → P(X = x), must sum to 1
Continuous: probability density function (PDF) → integrates to 1; P(a ≤ X ≤ b) = ∫f(x)dx from a to b

Expectation (expected value):

E[X] = Σ x·P(X=x)         (discrete)
E[X] = ∫ x·f(x) dx        (continuous)

The average value X takes over many samples. Linearity: E[aX + b] = aE[X] + b.

Variance:

Var(X) = E[(X − E[X])²] = E[X²] − (E[X])²

Joint, marginal, conditional:

Joint: P(X, Y) — probability of both
Marginal: P(X) = Σ_y P(X, Y=y) — summing out the other variable
Conditional: P(X | Y) = P(X, Y) / P(Y) — probability of X given Y is known

Independence: X and Y are independent if P(X, Y) = P(X)·P(Y). Knowing Y tells you nothing about X.

Bayes’ Theorem

The central result connecting prior belief, evidence, and updated belief:

P(H | D) = P(D | H) · P(H) / P(D)

P(H): prior — your belief before seeing data
P(D | H): likelihood — how probable the data is under hypothesis H
P(D): marginal likelihood / evidence — normalising constant, P(D) = Σ P(D | H_i)·P(H_i)
P(H | D): posterior — updated belief after seeing data

In ML this shows up everywhere:

Naïve Bayes classifier: apply Bayes’ theorem with an independence assumption on features
Bayesian neural networks: treat weights as distributions rather than point estimates
Any regularisation: the L2 penalty corresponds to a Gaussian prior on weights; L1 corresponds to a Laplace prior

Bayes is not just a formula. It’s a framework for updating beliefs rationally. The prior encodes what you knew before; the likelihood encodes what the data says; the posterior is what you should believe now.

Key Distributions in ML

Bernoulli: single binary outcome (coin flip). P(X=1) = p, P(X=0) = 1−p. Shows up in binary classifiers — the output is a Bernoulli probability.

Binomial: k successes in n Bernoulli trials.

P(X=k) = C(n,k) · p^k · (1−p)^(n−k)

Categorical / Multinoulli: generalises Bernoulli to K classes. P(X=k) = p_k, Σp_k = 1. The softmax output of a classifier is a categorical distribution.

Gaussian (Normal):

f(x) = (1/σ√2π) · exp(−(x−μ)²/2σ²)

Ubiquitous due to the Central Limit Theorem. Assumptions of Gaussian noise show up implicitly whenever you use mean squared error loss.

Multivariate Gaussian: extends to d dimensions with mean vector μ and covariance matrix Σ.

f(x) = (1/(2π)^(d/2) |Σ|^(1/2)) · exp(−½(x−μ)ᵀΣ⁻¹(x−μ))

Used in Gaussian mixture models, linear discriminant analysis, Gaussian processes.

Exponential family: a general class including Gaussian, Bernoulli, Poisson, Gamma. Many ML models (GLMs, exponential family PCA) are built on this class because it has clean mathematical properties — sufficient statistics, conjugate priors, tractable inference.

Maximum Likelihood Estimation (MLE)

Given data D = {x₁, …, xₙ} and a model parameterised by θ, find θ that makes the data most probable:

θ_MLE = argmax_θ P(D | θ) = argmax_θ ∏ P(xᵢ | θ)

Because products are numerically unstable and annoying to differentiate, take the log:

θ_MLE = argmax_θ Σ log P(xᵢ | θ)

Maximising log-likelihood is equivalent because log is monotonically increasing.

Why does this matter for ML?

Minimising mean squared error (MSE) for regression is exactly MLE under a Gaussian noise assumption. If you assume yᵢ = f(xᵢ) + ε where ε ~ N(0, σ²), then MLE for the parameters of f gives you MSE.

Minimising cross-entropy loss for classification is MLE under a categorical distribution assumption. The cross-entropy H(p, q) = −Σ p_k log q_k measures how well your predicted distribution q matches the true distribution p. Minimising it is equivalent to maximising the log-likelihood of the true labels under your model’s predicted distribution.

Loss functions are secretly negative log-likelihoods. This is the unifying insight: choose a noise model, write the likelihood, take the log, negate it, and you have a loss function.

Maximum A Posteriori (MAP) and Regularisation

MLE finds θ that maximises P(D | θ). MAP finds θ that maximises the posterior P(θ | D) ∝ P(D | θ)·P(θ):

θ_MAP = argmax_θ [log P(D | θ) + log P(θ)]

The extra term log P(θ) is the log prior. If you place a Gaussian prior on θ — P(θ) ∝ exp(−λ||θ||²) — then log P(θ) = −λ||θ||² + constant. Maximising this is equivalent to minimising:

loss + λ||θ||²

That’s L2 regularisation (weight decay). The regulariser is the prior in disguise.

L1 regularisation corresponds to a Laplace prior on weights, which places more probability mass near zero and produces sparse solutions.

Regularisation is not just a trick to prevent overfitting. It is a principled encoding of prior beliefs about what good parameters look like.

Bias-Variance Decomposition

For a regression model with MSE loss, the expected error decomposes:

E[(y − ŷ)²] = Bias² + Variance + Irreducible Noise

Bias: how much the model’s average prediction differs from the true value — a systematic error from wrong assumptions
Variance: how much predictions vary across different training sets — sensitivity to the specific sample seen
Irreducible noise: inherent randomness in the data that no model can capture

High bias (underfitting): the model is too simple. It misses patterns in both training and test data. A linear model for a nonlinear relationship.

High variance (overfitting): the model is too flexible. It memorises the training data and fails to generalise. A polynomial of degree 20 fit to 10 data points.

The tradeoff is not a law of nature — it’s a property of a given model class and training procedure. Modern deep learning has complicated the picture: very large neural networks can simultaneously achieve low bias and surprisingly low variance (the double descent phenomenon), which classical bias-variance theory didn’t predict.

Hypothesis Testing Concepts in ML Context

p-value: the probability of observing results at least as extreme as yours, assuming the null hypothesis is true. It is not the probability that the null is true.

In ML evaluation, classical hypothesis testing shows up in:

Comparing model performance across test sets (paired t-test, McNemar’s test)
A/B testing at deployment
Ablation studies where you need to establish that a component contributes significantly

The multiple comparisons problem: if you test 100 hyperparameter configurations and pick the best, your “best result” is optimistic — you’ve searched a large space. This is why final evaluation should happen on a held-out test set the model has never influenced, including the model selection process.

Information Theory in ML

Entropy: measures the uncertainty (or information content) of a distribution:

H(p) = −Σ p(x) log p(x)

A uniform distribution has maximum entropy (most uncertain). A point mass has zero entropy (no uncertainty).

Cross-entropy:

H(p, q) = −Σ p(x) log q(x)

Used as a loss function for classification. p is the true distribution (one-hot labels), q is the model’s predicted distribution. Cross-entropy = entropy of p + KL divergence.

KL divergence:

KL(p || q) = Σ p(x) log (p(x)/q(x))

Measures how much information is lost when using q to approximate p. Not symmetric. Minimising KL(true || model) is equivalent to MLE.

In variational autoencoders, the loss has two terms: reconstruction loss (cross-entropy or MSE) and a KL term that regularises the latent space toward a prior distribution.

Covariance, Correlation, and PCA

Covariance matrix Σ for a d-dimensional random variable X:

Σᵢⱼ = Cov(Xᵢ, Xⱼ) = E[(Xᵢ − μᵢ)(Xⱼ − μⱼ)]

Diagonal entries are variances; off-diagonal entries capture linear relationships between dimensions.

Principal Component Analysis (PCA): finds the directions (eigenvectors of Σ) that explain the most variance. Projects data onto a lower-dimensional subspace while preserving as much variance as possible.

Mechanically: compute Σ (or use SVD on the data matrix), sort eigenvectors by eigenvalue (variance explained), keep the top k.

PCA is MLE under a latent variable model called probabilistic PCA — the connection to statistics runs all the way down.

The Statistical View of Generalisation

A model trained on a finite dataset is trying to approximate a true underlying distribution. Generalisation is about how well the model captures that distribution rather than the specific sample.

Expected risk: E_{(x,y)~P}[loss(f(x), y)] — average loss over the true distribution. What you actually care about.

Empirical risk: (1/n)Σ loss(f(xᵢ), yᵢ) — average loss over your training set. What you can compute.

Empirical risk minimisation (ERM) is the statistical framework for training. Validation and test sets are tools for estimating the gap between empirical and expected risk.

The learning theory question is: under what conditions does minimising empirical risk lead to good expected risk? VC dimension, Rademacher complexity, and PAC learning bounds formalise this. For practical ML, the heuristic version is simpler: more data reduces the gap; more complex models increase it; generalisation is not guaranteed, only estimated.