Diffusion Models
How gradually adding and then reversing noise teaches a neural network to generate data — from the forward Markov chain to classifier-free guidance and latent diffusion.
Diffusion Models
What if generating an image was nothing more than learning to clean up noise?
Diffusion models are built on exactly this idea. Rather than learning a generator from scratch, they learn the reverse of a simple, known process: gradually adding Gaussian noise to data until it becomes pure static. A neural network trained to undo this noise one small step at a time turns out to be an extraordinarily powerful generative model.
First proposed in a probabilistic framework by Sohl-Dickstein et al. (2015) and brought to state-of-the-art image quality by DDPM (Ho et al., 2020), diffusion models now power Stable Diffusion, DALL-E 2, Midjourney, Imagen, and Sora.
1. The Core Intuition
Imagine taking a photograph and gradually blurring and adding static to it — first a little, then more and more — until after enough steps it looks like pure random noise. This is the forward process: deterministic in distribution, easy to simulate, and completely destroys the information in the original image.
Now reverse it. If you could learn to take a noisy image and predict what a slightly less noisy version would look like, you could run that operation hundreds of times — starting from pure noise and ending at a realistic image. This is the reverse process, and it is what a diffusion model learns.
The key insight: the forward process is fixed and known. We do not learn it. We only learn the reverse. And because each denoising step is small, the reverse distribution at each step is approximately Gaussian — making it easy to parameterise with a neural network.
x_t = 1.00x₀ + 0.00ε2. The Forward Process
The forward process is a fixed Markov chain that adds a small amount of Gaussian noise at each of steps. Given a data point , successive noisy versions are sampled as:
At each step, the previous sample is slightly shrunk (by ) and a small amount of noise (with variance ) is added. The sequence is the noise schedule — a fixed, pre-chosen set of small positive numbers.
The Closed-Form Forward Sample
The elegance of this Markov chain is that you can sample at any arbitrary step directly from , without simulating all intermediate steps. Defining:
the marginal at step is:
This means any noisy sample can be written as a linear combination of the clean image and a standard Gaussian noise vector :
| Term | Meaning |
|---|---|
| Signal coefficient — how much of the original image survives | |
| Noise coefficient — how much noise has been added | |
| as | At the final step, image is pure noise |
Why This Is So Useful
This closed-form means training is trivially parallelised: for any data point and any step , we can immediately compute a noisy sample in one shot. No need to simulate the chain step by step.
3. The Noise Schedule
The schedule controls how quickly information is destroyed. Two common choices:
Linear Schedule (DDPM)
Simple but suboptimal — it destroys signal too quickly at the start, wasting many steps on nearly-pure noise.
Cosine Schedule (Improved DDPM)
Keeps near 1 for longer, then drops more smoothly — the network spends more steps in the perceptually meaningful regime and fewer steps on nearly-pure noise.
Schedule Analysis
4. The Reverse Process
The forward process is fixed and known. The reverse is what we learn.
The true reverse — the distribution of given — is intractable because it requires knowing . However, if the forward steps are small enough, the reverse is approximately Gaussian:
We parameterise the mean with a neural network. Generation then proceeds as:
starting from and iterating down to .
The Posterior Given
The key mathematical fact: when conditioned on the clean image , the reverse distribution is tractable and Gaussian:
This posterior will be the target for our training objective.
"The network predicts ε at each step, which we subtract to recover the structured signal."
5. The Training Objective
From ELBO to a Simple Loss
Like VAEs, the training objective is the ELBO on the log-likelihood . After expanding through the Markov chain and simplifying, the ELBO decomposes into:
where each is a KL divergence between the true reverse posterior and the learned .
Since both are Gaussian, the KL has a closed form — it reduces to the squared difference between their means. Substituting the closed-form expression for and reparameterising in terms of the noise , Ho et al. showed that a simplified loss works better in practice:
In plain English: sample a clean image, a random noise vector, and a random timestep ; corrupt the image using the closed-form forward equation; then train the network to predict the noise that was added.
This is just a mean-squared error regression. No adversarial training, no latent variable inference — just supervised regression on noise prediction.
Three Equivalent Prediction Targets
A diffusion network can equivalently predict:
| Target | Formula | Notes |
|---|---|---|
| Noise | Ho et al. (DDPM) — default | |
| Clean image | Deduced from noise prediction | |
| Score | Song et al. — score matching view |
All three are mathematically equivalent and are related by simple linear transforms.
6. The Denoising Network: U-Net
The network takes a noisy image and a timestep as input and predicts the noise. The architecture of choice is the U-Net, originally designed for biomedical image segmentation.
Why U-Net?
A good noise predictor needs two properties:
- Global context — knowing the overall structure of the image to predict coherent noise.
- Fine-grained spatial output — the output is the same size as the input (pixel-level noise map).
U-Net satisfies both via an encoder–decoder structure with skip connections:
Input x_t
│
[Conv] → [Downsample] → [Conv] → [Downsample] → ... → [Bottleneck]
│
[Conv] ← [Upsample] ← [Conv] ← [Upsample] ← ... ←────────┘
│ ↑ ↑
└── skip ─┘ └── skip ────────┘
│
Output ε̂ (same shape as x_t)Skip connections concatenate encoder feature maps directly into the decoder, preserving fine spatial detail that would otherwise be lost through downsampling.
Timestep Conditioning
The timestep is injected into every residual block via a sinusoidal embedding (identical in form to Transformer positional embeddings):
This embedding is projected through a small MLP and added (or concatenated) to the feature maps at each resolution, telling the network which noise level it is operating at.
Attention in U-Net
Modern diffusion U-Nets insert self-attention layers at intermediate resolutions (e.g., 16×16 and 8×8 feature maps). This allows the network to reason about global structure — e.g., ensuring both eyes of a face are consistent — without the full quadratic cost of attending over all pixels.
Restore spatial high-frequencies lost during downsampling.
Injected into every ResBlock to scale activations based on noise level.
Operates at 16x16 and 8x8 resolutions to capture global coherence.
7. DDPM Sampling
The full DDPM sampling algorithm, starting from :
where (or at ), and .
Breaking down the update:
- Predict and remove noise: — a "cleaned" estimate.
- Re-add stochastic noise: — ensures the sample stays on the correct marginal distribution.
With steps, DDPM produces excellent samples but is slow — a single image requires 1000 forward passes through the U-Net.
8. DDIM: Faster Deterministic Sampling
DDIM (Denoising Diffusion Implicit Models, Song et al., 2020) derives a non-Markovian reverse process that produces the same marginals as DDPM but can be sampled with far fewer steps.
The DDIM update rule, parameterised by :
where .
| Behaviour | |
|---|---|
| Equivalent to DDPM — fully stochastic | |
| Fully deterministic — same always gives the same |
Setting gives a deterministic ODE sampler. Because the trajectory is smooth and deterministic, you can skip large chunks of the timestep sequence — sampling at only 50 or even 20 steps instead of 1000, with minimal quality loss.
The DDIM ODE
In the limit, the reverse process is equivalent to solving an ODE defined by the neural network. This connects diffusion models to continuous normalizing flows — and means that any ODE solver (Euler, Heun, DPM-Solver) can be used for faster sampling.
By setting η=0, we convert the stochastic SDE into a deterministic ODE. This allows us to use larger step sizes with minimal error.
9. Score Matching: A Unified View
Parallel to the DDPM line of work, score-based generative models (Song & Ermon, 2019) approached the same idea differently — and the two frameworks turn out to be equivalent.
The Score Function
The score of a distribution is the gradient of its log-density with respect to :
The score points in the direction of increasing probability — it tells you which way to move a sample to make it more likely. Sampling can be done by following the score (Langevin dynamics):
Score Matching
We cannot compute directly (we don't know ). But denoising score matching shows that a network trained to predict the noise is implicitly estimating the score:
Noise prediction = (scaled) score estimation. The DDPM training loss is a weighted sum of score matching objectives at all noise levels simultaneously.
The Score Field
The score ∇ log p(x) points toward high-density regions. As noise σ increases, the field becomes smoother, providing gradients even in low-density "dead zones."
10. Conditional Generation
Unconditional diffusion generates random samples from the training distribution. Conditional generation produces samples matching a given condition (a text prompt, a class label, a reference image).
Classifier Guidance
Train the diffusion model unconditionally, plus a separate noise-aware classifier that classifies noisy images at any timestep. At inference, modify the score with the classifier gradient:
The scalar is the guidance scale — higher values push samples more strongly toward the condition.
Classifier-Free Guidance (CFG)
Classifier guidance requires training a separate classifier. Classifier-Free Guidance (Ho & Salimans, 2021) eliminates this by training a single model for both conditional and unconditional generation using random conditioning dropout:
- During training, randomly drop the condition (replace with a null token ) with probability .
- The model learns both and .
At inference, the two predictions are combined:
The guidance scale controls the creativity vs. adherence trade-off:
| Effect | |
|---|---|
| No guidance — purely conditional model | |
| – | Standard use — diverse yet prompt-adherent |
| Prompt-adherent but oversaturated, less diverse |
CFG is used in virtually every text-to-image model today — Stable Diffusion, DALL-E 2, Imagen, and Flux all rely on it.
At high γ, colors become too intense and diversity drops.
At low γ, the model is more "artistic" but follows prompts loosely.
11. Latent Diffusion Models
Running diffusion in pixel space is expensive — a 512×512 RGB image has ~786K dimensions. Latent Diffusion Models (LDM, Rombach et al., 2022 — the basis of Stable Diffusion) solve this by running diffusion in a compressed latent space.
Two-Stage Architecture
Stage 1 — Train a VQ-regularised Autoencoder:
A convolutional encoder compresses the image into a latent representation — typically for a input (a compression). A decoder reconstructs the image: . The autoencoder is trained with a perceptual loss and a patch-based GAN discriminator.
Stage 2 — Train a Diffusion Model in Latent Space:
The diffusion model operates entirely on the compressed latent , not the full image. Sampling:
- Sample in the latent space.
- Denoise using the diffusion model (DDIM, ~50 steps).
- Decode: .
Cross-Attention for Text Conditioning
Text prompts are encoded by a frozen CLIP or T5 text encoder into a sequence of token embeddings . These are injected into the U-Net via cross-attention at every resolution:
The latent features attend over the text token embeddings, allowing every spatial location to selectively read from relevant parts of the prompt.
We use a VAE to compress a 512x512 image into a 64x64 latent grid. This reduces the search space by 48x.
The U-Net only sees the 64x64 latent. This makes training and sampling exponentially faster and cheaper.
Text embeddings are injected into the latent diffusion process via Cross-Attention layers.
12. Key Diffusion Model Families
| Model | Year | Key Contribution |
|---|---|---|
| DDPM | 2020 | Simplified training — just predict noise (MSE) |
| Improved DDPM | 2021 | Cosine schedule; learned variance |
| DDIM | 2021 | Deterministic non-Markovian sampling; 10–50× faster |
| Score SDE | 2021 | Continuous-time SDE formulation; unified framework |
| Classifier Guidance | 2021 | Use a classifier gradient to steer generation |
| DALL-E 2 | 2022 | CLIP image embeddings as conditioning |
| Latent Diffusion / SD | 2022 | Diffusion in VAE latent space; open-source |
| Imagen | 2022 | Cascaded pixel diffusion + large T5 text encoder |
| DiT | 2022 | Transformer (not U-Net) as the denoising backbone |
| SDXL | 2023 | Larger LDM with two text encoders; multi-aspect |
| Flow Matching | 2022–23 | Straight ODE trajectories; simpler training than diffusion |
| SD3 / Flux | 2024 | Multimodal Diffusion Transformer (MMDiT); flow matching |
13. The Diffusion Transformer (DiT)
While U-Net dominated the first generation of diffusion models, DiT (Peebles & Xie, 2022) replaced it with a pure Transformer backbone.
The image is split into patches, linearly projected into token embeddings, and processed by a standard Transformer with adaLN-Zero conditioning — the timestep and class label modulate the LayerNorm scale and shift parameters via a small MLP.
DiT scales more predictably than U-Net (FID improves smoothly with compute) and forms the backbone of Stable Diffusion 3 and Flux.
14. Comparing Diffusion to Other Generative Models
| Property | Diffusion | GAN | VAE | Normalizing Flow | Autoregressive |
|---|---|---|---|---|---|
| Sample quality (images) | Best | High | Medium | Low–Med | Medium |
| Sampling speed | Slow (steps) | Fast | Fast | Fast | Slow (seq.) |
| Training stability | Excellent | Poor | Good | Good | Good |
| Likelihood | Bound (ELBO) | No | Bound | Exact | Exact |
| Mode coverage | Excellent | Often poor | Good | Good | Good |
| Conditional control | Excellent (CFG) | Moderate | Limited | Limited | Good |
| Latent space | Noisy | Implicit | Structured | Exact | No |
Diffusion models trade sampling speed for everything else — they dominate on quality, diversity, and conditional control, at the cost of requiring hundreds to thousands of network evaluations per sample. Techniques like DDIM, DPM-Solver, consistency models, and flow matching progressively close this speed gap.
Consistency Models
A newer paradigm (Song et al., 2023) trains a model to map any point on the diffusion trajectory directly to , enabling single-step generation. Consistency distillation can produce near-DDPM quality in 1–4 steps, making real-time diffusion practical.
Summary
| Concept | Key Idea |
|---|---|
| Forward process | Fixed Markov chain: |
| Closed-form sample | — skip to any step instantly |
| Reverse process | Learned Gaussian — iterate from noise to data |
| Training objective | Predict the noise : |
| U-Net | Encoder–decoder with skip connections and timestep conditioning |
| DDIM | Deterministic ODE sampler — same quality in 50 steps instead of 1000 |
| Score matching | Noise prediction scaled score — two views of the same model |
| Classifier-free guidance | — fidelity vs. diversity knob |
| Latent diffusion | Run DDPM in VAE latent space — 48× fewer dimensions, same quality |
| DiT | Transformer backbone replacing U-Net — scales predictably with compute |