Diffusion Models — How Image Generation Actually Works — Bench

The Idea in One Sentence

A diffusion model learns to reverse a process that gradually adds noise to data, until you can start from pure noise and run the reverse process to generate realistic data.

That’s the complete description of the generative mechanism. Everything else — the neural networks involved, the conditioning on text prompts, the various practical improvements — is engineering to make this idea work at scale.

The Forward Process

Training a diffusion model starts with a dataset of real images. For each image, the model defines a forward diffusion process: a sequence of T steps that progressively adds Gaussian noise to the image according to a fixed noise schedule. At step 0, the image is clean. At step T, the image is pure Gaussian noise — the original image content is completely destroyed.

The forward process is not learned. It’s fixed by design. The noise schedule (how much noise is added at each step) is chosen to ensure that by step T, the image is indistinguishable from random noise regardless of what the original image was. The math is tractable: given the original image x₀ and a step t, you can compute the noisy image xₜ directly without simulating all intermediate steps — a nice property that enables efficient training.

The Reverse Process: What the Network Learns

The reverse process is what the neural network learns. At each step t, given a noisy image xₜ, predict the noise that was added (or equivalently, predict the original clean image, or predict a “velocity” direction — different parameterizations are mathematically equivalent). Use this prediction to move from xₜ to a slightly less noisy xₜ₋₁. Repeat for T steps, starting from pure noise xT, and you arrive at a clean image.

The network (typically a U-Net architecture, modified to handle time-step conditioning) is trained with a simple objective: given a randomly sampled time step t and a noisy version of a real image, predict the noise that was added. The loss is mean squared error between the predicted noise and the actual noise. This is a supervised learning problem — the labels (the actual noise) are known during training because you constructed the noisy images by adding noise you control.

The loss is simple; the results are surprisingly rich. The network is trained to denoise images at all noise levels simultaneously, which forces it to develop a hierarchical understanding of image structure — at high noise levels (early steps in the reverse process), the network makes broad decisions about overall composition and structure; at low noise levels (late steps), it refines fine details.

Classifier-Free Guidance and Text Conditioning

An unconditional diffusion model generates random images from the training distribution. To generate images matching a text description, the model needs to condition its denoising on the text.

The approach used in Stable Diffusion and similar models: encode the text prompt into an embedding using a pre-trained text encoder (CLIP or similar), then condition the U-Net denoising network on this embedding at every step. Cross-attention between the image features and the text embedding allows the network to attend to relevant parts of the text when generating different parts of the image.

Classifier-free guidance (CFG) is the technique that makes text-conditional generation sharp and coherent. The model is trained jointly as a conditional model (given the text, predict the noise) and an unconditional model (no text, predict the noise). At generation time, the predicted noise is a linear combination: (1 + γ) · conditional prediction − γ · unconditional prediction, where γ is the guidance scale. High guidance scale produces images that strongly match the text prompt but sacrifice diversity and naturalness. Low guidance scale produces more diverse but less prompt-adherent images. CFG is why generated images look “over-sharp” or “painterly” at high guidance — it’s amplifying the conditional signal beyond what training samples would naturally exhibit.

The Latent Diffusion Trick

Running diffusion in pixel space is expensive — images have many pixels, each with multiple channels, and running hundreds of denoising steps at this resolution is computationally prohibitive for high-resolution generation.

Latent diffusion models (the architecture underlying Stable Diffusion) apply the diffusion process in a compressed latent space. A variational autoencoder (VAE) is trained separately to compress images into a lower-dimensional latent representation (typically 4x or 8x smaller per spatial dimension). The diffusion process then operates in this latent space rather than pixel space — dramatically reducing computation. The final step is to decode the generated latent back to pixel space using the VAE decoder.

The VAE is the component that handles fine pixel-level detail; the diffusion model handles semantic and compositional structure in the latent space. This separation of concerns is what makes high-resolution generation tractable on consumer hardware.

CLIP and the Text-Image Embedding Space

CLIP (Contrastive Language-Image Pre-Training, OpenAI 2021) trained a model to align image and text embeddings using a contrastive objective on 400 million image-text pairs from the internet. The training signal: the embedding for an image and its caption should be similar; the embeddings for an image and a random caption should be dissimilar.

The result is an embedding space shared between images and text — where semantically related images and descriptions are geometrically close. CLIP embeddings are the most common text conditioning mechanism in diffusion models: the prompt is encoded by CLIP’s text encoder, and the diffusion model learns to produce images whose CLIP embedding matches the prompt embedding.

The CLIP embedding space has semantic structure similar to word embeddings: directions in the space encode concepts, and arithmetic works (the embedding for “a photo of a cat wearing a hat” is interpretably close to the combination of “cat,” “photo,” and “hat” directions). The diffusion model doesn’t generate images directly from prompts — it generates images consistent with the prompt’s position in the CLIP embedding space.

What the Model Has Actually Learned

This is the part that remains genuinely unclear. A diffusion model trained on 5 billion image-caption pairs can generate photorealistic images of arbitrary scenes, styles, and compositions. It can generate coherent hands (mostly), accurate text in images (sometimes), consistent lighting across complex scenes. It can generalize to prompts describing things not in its training set by combining concepts it has seen separately.

What internal representation is supporting this? The U-Net at each denoising step is doing something — computing some features of the current noisy image, the text embedding, and the noise level — that produces a useful noise prediction. But what those features are, what concepts they encode, and how they combine to produce compositionally coherent images is not well characterized.

Some things are known. The attention maps in the U-Net at intermediate layers correlate with semantic regions of the image — the attention to the word “dog” in the prompt is high in the spatial region where the dog appears. Different noise levels correspond to different levels of abstraction — high-noise denoising determines composition, low-noise denoising determines texture and fine detail.

But the full computational story — how the model represents the concept “a medieval castle at sunset” and how it translates that representation into specific pixel decisions — is not known. This is the same opacity problem as discriminative models, now applied to generative ones.

The Failure Modes Reveal the Seams

Where diffusion models fail is informative. Generating multiple distinct objects with specific relationships (“a red ball to the left of a blue cube”) is unreliable — the model has not fully decomposed scenes into object-level representations with explicit spatial relations. Generating text within images is error-prone (recent models have improved significantly here). Generating coherent hands with the right number of fingers was a persistent failure mode, improved in newer models through additional training data and fine-tuning.

These failures suggest that the model represents images primarily through texture, style, and holistic semantic content rather than through explicit object models with discrete attributes. The “understanding” is distributional rather than structured. The model has learned what scenes look like, in a statistical sense, without necessarily learning a compositional model of what scenes are made of.

This is consistent with the Bitter Lesson’s direction — the model learned from data rather than from an explicitly structured representation — and carries the same limitation: statistical competence without explicit structure can fail at systematic generalization tasks that require reasoning about discrete entities and their relationships.