The Transformer Architecture — Bench

What Came Before

Before transformers, sequence models were dominated by Recurrent Neural Networks — LSTMs and GRUs in particular. The recurrent architecture processes sequences one token at a time: at each step, the network updates a hidden state based on the current input and the previous hidden state. The hidden state is supposed to carry all relevant information from previous positions forward.

The problem is the information bottleneck. All context from arbitrarily long sequences has to be compressed into a fixed-size hidden state vector. Long-range dependencies — where a word or phrase early in a document is necessary to correctly interpret something much later — are difficult to preserve across many sequential steps. In practice, RNNs struggled with sequences longer than a few hundred tokens. LSTMs and GRUs helped with gating mechanisms that allowed the network to selectively remember and forget, but the fundamental bottleneck remained.

A second problem: sequential computation is slow. Each step depends on the previous step’s hidden state, so the forward pass cannot be parallelized across the sequence. Training on long sequences was slow, and modern hardware (GPUs, TPUs) is built for parallelism.

Both problems are resolved by the transformer.

Self-Attention: The Core Mechanism

The transformer’s key innovation is the self-attention mechanism. Instead of processing tokens one at a time with a hidden state, self-attention relates every token in the sequence to every other token simultaneously. For a sequence of n tokens, self-attention computes an n × n matrix of attention weights, where each entry (i, j) encodes how much position i should attend to position j.

The computation uses three projections of each token’s embedding: Query (Q), Key (K), and Value (V). For each token, its query is compared against every other token’s key by dot product, scaled by the square root of the key dimension, and normalized through softmax to produce attention weights summing to 1. The output for each token is then the weighted sum of all tokens’ values, weighted by these attention scores.

Formally: Attention(Q, K, V) = softmax(QKᵀ / √dₖ)V

The intuition: each token asks a query (“what am I looking for?”), every other token presents a key (“what do I contain?”), and the match between a query and a key determines how much that token’s value is incorporated into the output. A verb asking what its subject is will find high attention weights on nearby nouns that match its syntactic expectations. A pronoun will attend to its antecedent. A token at the end of a long document can directly attend to a relevant token at the beginning, with no intermediate steps degrading the signal.

Self-attention costs O(n²) in memory and computation — every pair of tokens must be compared. For short sequences this is fine; for very long sequences (100k+ tokens) it is a significant constraint that has driven substantial research into efficient attention variants (sparse attention, linear attention approximations, sliding window attention).

Multi-Head Attention

A single attention head computes one pattern of attention over the sequence. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V projections and therefore its own learned attention pattern. The outputs of all heads are concatenated and projected back to the model dimension.

The intuition: different heads can attend to different aspects of the sequence simultaneously. One head might capture syntactic relationships; another might capture semantic similarity; another might track long-range coreference. The model learns to allocate different types of relational reasoning to different heads. Empirical probing studies have found interpretable attention patterns in trained models — heads that specialize in detecting subject-verb agreement, heads that track pronouns to their antecedents, heads that detect positional relationships.

In practice, the specialization is messier than this suggests. Many heads appear redundant; ablation studies show that most heads can be removed with small performance loss. The multi-head structure seems to be more about providing diverse initial representations for the model to select from than about strict specialization.

The Full Architecture

A transformer model stacks transformer blocks, each consisting of a multi-head self-attention layer followed by a position-wise feedforward network, with residual connections and layer normalization around each sub-layer.

The feedforward network in each block is two linear layers with a nonlinearity between them, applied independently to each position. Its width (typically 4x the model dimension) is where much of the model’s parameter count lives. The feedforward layers are understood to store factual associations — the “knowledge” in a language model, in contrast to the attention layers which handle relational reasoning and context integration.

Residual connections — adding the layer’s input to its output before the next layer — are essential for training very deep networks. They give gradients a direct path backward through the network, avoiding the vanishing gradient problem. Layer normalization stabilizes the activations at each layer, making training more reliable.

Because self-attention has no inherent notion of position — it computes relations between all pairs of tokens regardless of their distance — positional information must be explicitly injected. The original transformer used sinusoidal positional encodings added to token embeddings. Modern large language models use rotary positional embeddings (RoPE) or learned absolute positional embeddings, which handle longer context windows more gracefully.

Why Transformers Work So Well at Scale

The transformer architecture turns out to have a property that was not fully understood when it was proposed: it scales extremely well. Increasing model size (parameters), data quantity, and compute consistently improves performance on language tasks, with no obvious ceiling reached through 2024. Recurrent architectures did not exhibit this clean scaling behavior.

The architectural reasons are partly understood. Self-attention is an expressive operation — it can compute, in a single layer, any weighted combination of the input sequence, weighted by arbitrary learned similarities. The feedforward layers provide ample capacity for storing associations. The residual structure makes optimization tractable at depth. Together, these properties create a model class where more compute invested in training reliably translates to better-learned representations.

The scaling behavior also depends on data. Transformers trained on text are implicitly learning the statistical structure of human language — the syntactic patterns, semantic relationships, factual associations, and reasoning patterns present in the training corpus. The expressiveness of the architecture means that more data and compute yield more of this structure being captured.

Attention Beyond Language

Transformers have generalized far beyond text. Vision Transformers (ViT) apply the transformer architecture to images by splitting the image into patches, treating each patch as a token, and running standard transformer self-attention over the sequence of patches. They match or exceed convolutional networks on image classification when trained at sufficient scale.

Protein structure prediction (AlphaFold2, 2021) uses transformers at its core to represent relationships between amino acid residues in a protein sequence and infer the three-dimensional structure from those relationships. The accuracy improvement was discontinuous — AlphaFold2 essentially solved the protein folding problem that had resisted fifty years of computational biology. The architecture was a transformer; the training task was self-supervised structure prediction.

The pattern suggests that self-attention — the mechanism of relating every element of a structured input to every other element through learned similarity — is a general-purpose relational computation primitive, not a language-specific one. Wherever the input has structure that requires reasoning about relationships between parts, transformers are an applicable architecture.

What Attention Is and Isn’t Doing

The word “attention” invites anthropomorphization — the model is “paying attention” to relevant tokens, “focusing” on important context, “reading carefully.” The mechanism is more mechanical than this. Self-attention computes dot-product similarities between learned linear projections of token embeddings and uses those similarities as weights for a weighted sum. It is a differentiable, parallelized, learned lookup.

The interpretability of attention weights is contested. Early papers treated attention weights as explanations — if a token has high attention weight on another, the model is “using” that token to process this one. Subsequent work showed that high attention weights don’t necessarily indicate causal influence on the output, and that attention weights can be misleading as explanations. The actual computation the model performs cannot be read off the attention pattern alone.

This is the general limitation: the transformer is highly legible at the architectural level and nearly opaque at the computational level. We know what operations each layer performs in the abstract. We don’t know what algorithm the trained model implements in the concrete. That gap is what mechanistic interpretability is trying to close.