Scaling Laws and Emergent Capabilities — Bench

The Unexpected Regularity

In 2020, a team at OpenAI led by Jared Kaplan published a paper that changed how AI research was conducted. They trained language models across five orders of magnitude in scale — model size ranging from 768 parameters to 1.5 billion, dataset size from 22 million to 23 billion tokens, compute from 10⁶ to 10²³ FLOPs — and measured how test loss changed with each variable.

The result: test loss decreases as a smooth power law with each of compute, parameters, and data — when the other variables are not a bottleneck. Double the compute while scaling model size appropriately, and the loss decreases by a predictable factor. The relationship holds across five orders of magnitude with no visible deviation. The loss curves are nearly straight lines on a log-log plot.

This was surprising for several reasons. Neural network training is a highly nonlinear optimization process in a high-dimensional non-convex landscape. There is no obvious reason why the outcome of this process should depend so smoothly and predictably on scale. Yet it does. The regularity is empirical — it’s observed, not derived — and it appears to be quite robust.

The practical implication was immediately recognized. If you can predict how much compute a model will need to achieve a target loss level, you can plan AI development as an engineering discipline rather than an empirical one. The scaling laws turned AI model development from trial and error into something closer to extrapolated engineering.

What the Power Laws Are Saying

A power law relationship L ∝ N^α means that every fixed multiplicative increase in model size N produces a fixed additive decrease in log-loss. There are no sudden improvements when crossing a threshold, and no diminishing returns at the scales studied. The curve is smooth.

Kaplan et al.’s key finding on the compute-optimal frontier: given a fixed compute budget, you should primarily scale model size, keeping dataset size roughly constant. Larger models are more parameter-efficient — they achieve lower loss per FLOP by needing fewer passes through the data.

The 2022 Chinchilla paper from DeepMind (Hoffmann et al.) challenged this conclusion. They ran a much more thorough set of experiments across a wider range of compute budgets and found a different optimal ratio: for compute-optimal training, model size and training tokens should scale equally. The Chinchilla-optimal recipe uses a smaller model trained on more data than the Kaplan et al. prescription would suggest.

The concrete demonstration: Chinchilla (70 billion parameters, 1.4 trillion tokens) outperformed Gopher (280 billion parameters, 300 billion tokens) despite using roughly the same training compute. Gopher was undertrained by Chinchilla’s prescription — its large parameter count was not matched by sufficient data. Subsequent large models (Llama, Mistral) adopted Chinchilla-optimal or data-intensive recipes and achieved strong performance at smaller sizes than their predecessors.

The disagreement between Kaplan and Chinchilla partly reflects what “optimal” means under different constraints. Kaplan optimized for loss at end of training; Chinchilla optimized for loss per unit of training compute. The two criteria converge when you’re training once-and-deploying; they diverge when deployment cost (inference) matters. A smaller model trained longer is cheaper to run in production.

Emergent Capabilities

The scaling laws describe a smooth decrease in loss with compute. But the capabilities that users interact with — coding, reasoning, translation, instruction-following — are not smooth. They appear to be absent below some threshold and present above it, with transitions that look discontinuous on a linear scale.

This is emergence in the technical sense used by Wei et al. (2022): a capability is emergent if it is not present in smaller models and appears in larger ones at a rate faster than linear in scale. The standard examples are few-shot arithmetic (solving math problems from examples), chain-of-thought reasoning (showing intermediate steps), and multi-step commonsense inference.

Several caveats matter. First, what looks like a phase transition often depends on the metric. If you measure the fraction of problems answered exactly correctly, you see a sharp threshold. If you measure a softer metric that gives partial credit, the improvement is gradual. Some “emergent” capabilities may be threshold effects in measurement, not phase transitions in the model.

Second, emergence relative to model size depends on the benchmark. A capability that appears emergent when measured against model parameters might look smooth when measured against training compute, or might appear at different thresholds in different architectures. The apparent discontinuities may be partly an artifact of the granularity of model sizes tested.

Third, and most interestingly, some capabilities that seemed emergent in 2022 are now predictable from smaller models using better benchmarks. Schaeffer et al. (2023) argued that emergence is a mirage — a consequence of nonlinear metrics applied to smooth underlying model behavior. This view is contested; the mechanistic question of whether there are genuine phase transitions in how the model processes information, as opposed to smooth capability improvements measured by thresholded metrics, is not settled.

The Compute Trajectory

The scaling laws have an uncomfortable implication for anyone thinking about AI development trajectories. If capability improvements scale predictably with compute, and compute continues on its historical trajectory (roughly 4x per year in training runs for frontier models), then the performance of frontier systems will continue improving at a predictable rate, barring architectural breakthroughs or data limits.

The data wall is the nearest concrete limit. The scaling laws assume that more training tokens are available. The total stock of high-quality text on the internet is finite — estimates range from 5 trillion to 100 trillion tokens of quality text that might be suitable for training. Models trained on Chinchilla-optimal recipes are now consuming a substantial fraction of this. Synthetic data generation (using AI systems to produce training data for future AI systems) is one proposed solution; its quality, diversity, and the feedback loops it creates are active research questions.

The compute constraint is less binding at current prices but is subject to hardware development, energy availability, and economic limits. Training GPT-4-class models required on the order of 10²⁴ to 10²⁵ FLOPs. If the scaling laws continue, matching another order of magnitude of capability improvement would require roughly 10²⁵ to 10²⁶ FLOPs — feasible in principle at projected hardware costs by the late 2020s.

What the Scaling Laws Don’t Explain

The scaling laws are predictive engineering tools, not mechanistic theories. They describe what happens at scale; they don’t explain why. Why does more compute produce lower loss? Because gradient descent over a larger hypothesis space with more data finds better representations. Why does the relationship take a power-law form? No one has a fully satisfying answer derived from first principles. The regularity is observed; its theoretical basis is underdetermined.

The scaling laws also don’t predict which capabilities will emerge at which scale, or whether capability improvements on standard benchmarks will translate to capabilities in deployment contexts. A model with lower perplexity on held-out text is not necessarily more useful for a specific application. The relationship between the loss metric the scaling laws track and the downstream task performance users care about is indirect.

The most important open question the scaling laws raise: do they continue? Every previously proposed limit to neural network scaling has been exceeded. The pessimistic case — that scaling will hit diminishing returns or hard limits at model sizes beyond current training runs — has been wrong repeatedly. The optimistic case — that scaling alone will continue producing qualitative capability improvements for another several orders of magnitude — is unconfirmed. The honest answer is that we don’t know, and the history of AI forecasting suggests high uncertainty is appropriate.