My App
Generative Models

Diffusion Models

How gradually adding and then reversing noise teaches a neural network to generate data — from the forward Markov chain to classifier-free guidance and latent diffusion.

Diffusion Models

What if generating an image was nothing more than learning to clean up noise?

Diffusion models are built on exactly this idea. Rather than learning a generator from scratch, they learn the reverse of a simple, known process: gradually adding Gaussian noise to data until it becomes pure static. A neural network trained to undo this noise one small step at a time turns out to be an extraordinarily powerful generative model.

First proposed in a probabilistic framework by Sohl-Dickstein et al. (2015) and brought to state-of-the-art image quality by DDPM (Ho et al., 2020), diffusion models now power Stable Diffusion, DALL-E 2, Midjourney, Imagen, and Sora.


1. The Core Intuition

Imagine taking a photograph and gradually blurring and adding static to it — first a little, then more and more — until after enough steps it looks like pure random noise. This is the forward process: deterministic in distribution, easy to simulate, and completely destroys the information in the original image.

Now reverse it. If you could learn to take a noisy image and predict what a slightly less noisy version would look like, you could run that operation hundreds of times — starting from pure noise and ending at a realistic image. This is the reverse process, and it is what a diffusion model learns.

The key insight: the forward process is fixed and known. We do not learn it. We only learn the reverse. And because each denoising step is small, the reverse distribution at each step is approximately Gaussian — making it easy to parameterise with a neural network.

ENTROPY
x₀ (Clean)t = 0x_T (Noise)
x_t = 1.00x₀ + 0.00ε

2. The Forward Process

The forward process is a fixed Markov chain that adds a small amount of Gaussian noise at each of TT steps. Given a data point x0q(x)\mathbf{x}_0 \sim q(\mathbf{x}), successive noisy versions are sampled as:

q(xtxt1)=N ⁣(xt;  1βtxt1,  βtI)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\right)

At each step, the previous sample is slightly shrunk (by 1βt\sqrt{1-\beta_t}) and a small amount of noise (with variance βt\beta_t) is added. The sequence β1,β2,,βT\beta_1, \beta_2, \dots, \beta_T is the noise schedule — a fixed, pre-chosen set of small positive numbers.

The Closed-Form Forward Sample

The elegance of this Markov chain is that you can sample xt\mathbf{x}_t at any arbitrary step tt directly from x0\mathbf{x}_0, without simulating all intermediate steps. Defining:

αt=1βt,αˉt=s=1tαs\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

the marginal at step tt is:

q(xtx0)=N ⁣(xt;  αˉtx0,  (1αˉt)I)q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right)

This means any noisy sample can be written as a linear combination of the clean image and a standard Gaussian noise vector ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}):

xt=αˉtx0+1αˉtϵ\boxed{\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}}
TermMeaning
αˉt\sqrt{\bar{\alpha}_t}Signal coefficient — how much of the original image survives
1αˉt\sqrt{1-\bar{\alpha}_t}Noise coefficient — how much noise has been added
αˉt0\bar{\alpha}_t \to 0 as tTt \to TAt the final step, image is pure noise

Why This Is So Useful

This closed-form means training is trivially parallelised: for any data point x0\mathbf{x}_0 and any step tt, we can immediately compute a noisy sample xt\mathbf{x}_t in one shot. No need to simulate the chain step by step.


3. The Noise Schedule

The schedule {βt}\{\beta_t\} controls how quickly information is destroyed. Two common choices:

Linear Schedule (DDPM)

βt=βstart+t1T1(βendβstart),βstart=104,  βend=0.02\beta_t = \beta_\text{start} + \frac{t-1}{T-1}(\beta_\text{end} - \beta_\text{start}), \quad \beta_\text{start} = 10^{-4},\; \beta_\text{end} = 0.02

Simple but suboptimal — it destroys signal too quickly at the start, wasting many steps on nearly-pure noise.

Cosine Schedule (Improved DDPM)

αˉt=f(t)f(0),f(t)=cos2 ⁣(t/T+s1+sπ2)\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)

Keeps αˉt\bar{\alpha}_t near 1 for longer, then drops more smoothly — the network spends more steps in the perceptually meaningful regime and fewer steps on nearly-pure noise.

Schedule Analysis

Signal Survival (√ᾱ_t)
Noise Cumulative (√(1-ᾱ_t))

4. The Reverse Process

The forward process is fixed and known. The reverse is what we learn.

The true reverse — the distribution of xt1\mathbf{x}_{t-1} given xt\mathbf{x}_t — is intractable because it requires knowing p(x)p(\mathbf{x}). However, if the forward steps are small enough, the reverse is approximately Gaussian:

pθ(xt1xt)=N ⁣(xt1;  μθ(xt,t),  σt2I)p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I}\right)

We parameterise the mean μθ\boldsymbol{\mu}_\theta with a neural network. Generation then proceeds as:

xt1=μθ(xt,t)+σtz,zN(0,I)\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}, \qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

starting from xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and iterating down to x0\mathbf{x}_0.

The Posterior Given x0\mathbf{x}_0

The key mathematical fact: when conditioned on the clean image x0\mathbf{x}_0, the reverse distribution is tractable and Gaussian:

q(xt1xt,x0)=N(xt1;  μ~t(xt,x0),  β~tI)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1};\; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde{\beta}_t \mathbf{I}) μ~t=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,β~t=1αˉt11αˉtβt\tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t, \qquad \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t

This posterior will be the target for our training objective.

Timestept = 100
SNR Tracking

"The network predicts ε at each step, which we subtract to recover the structured signal."


5. The Training Objective

From ELBO to a Simple Loss

Like VAEs, the training objective is the ELBO on the log-likelihood logpθ(x0)\log p_\theta(\mathbf{x}_0). After expanding through the Markov chain and simplifying, the ELBO decomposes into:

L=LT+t=2TLt1+L0\mathcal{L} = \mathcal{L}_T + \sum_{t=2}^{T} \mathcal{L}_{t-1} + \mathcal{L}_0

where each Lt1\mathcal{L}_{t-1} is a KL divergence between the true reverse posterior q(xt1xt,x0)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) and the learned pθ(xt1xt)p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t).

Since both are Gaussian, the KL has a closed form — it reduces to the squared difference between their means. Substituting the closed-form expression for μ~t\tilde{\boldsymbol{\mu}}_t and reparameterising in terms of the noise ϵ\boldsymbol{\epsilon}, Ho et al. showed that a simplified loss works better in practice:

Lsimple=Et,x0,ϵ ⁣[ϵϵθ ⁣(αˉtx0+1αˉtϵ,  t)2]\boxed{\mathcal{L}_\text{simple} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\right) \right\|^2\right]}

In plain English: sample a clean image, a random noise vector, and a random timestep tt; corrupt the image using the closed-form forward equation; then train the network to predict the noise that was added.

This is just a mean-squared error regression. No adversarial training, no latent variable inference — just supervised regression on noise prediction.

Three Equivalent Prediction Targets

A diffusion network can equivalently predict:

TargetFormulaNotes
Noise ϵ\boldsymbol{\epsilon}ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)Ho et al. (DDPM) — default
Clean image x0\mathbf{x}_0x^0=xt1αˉtϵθαˉt\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}}Deduced from noise prediction
Score xlogp(xt)\nabla_{\mathbf{x}} \log p(\mathbf{x}_t)ϵθ1αˉt-\frac{\boldsymbol{\epsilon}_\theta}{\sqrt{1-\bar{\alpha}_t}}Song et al. — score matching view

All three are mathematically equivalent and are related by simple linear transforms.


6. The Denoising Network: U-Net

The network ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) takes a noisy image and a timestep as input and predicts the noise. The architecture of choice is the U-Net, originally designed for biomedical image segmentation.

Why U-Net?

A good noise predictor needs two properties:

  1. Global context — knowing the overall structure of the image to predict coherent noise.
  2. Fine-grained spatial output — the output is the same size as the input (pixel-level noise map).

U-Net satisfies both via an encoder–decoder structure with skip connections:

Input x_t

[Conv] → [Downsample] → [Conv] → [Downsample] → ... → [Bottleneck]

[Conv] ← [Upsample]  ← [Conv] ← [Upsample]  ← ... ←────────┘
    │         ↑               ↑
    └── skip ─┘               └── skip ────────┘

Output ε̂ (same shape as x_t)

Skip connections concatenate encoder feature maps directly into the decoder, preserving fine spatial detail that would otherwise be lost through downsampling.

Timestep Conditioning

The timestep tt is injected into every residual block via a sinusoidal embedding (identical in form to Transformer positional embeddings):

emb(t)2i=sin(t/100002i/d),emb(t)2i+1=cos(t/100002i/d)\text{emb}(t)_{2i} = \sin(t / 10000^{2i/d}), \qquad \text{emb}(t)_{2i+1} = \cos(t / 10000^{2i/d})

This embedding is projected through a small MLP and added (or concatenated) to the feature maps at each resolution, telling the network which noise level it is operating at.

Attention in U-Net

Modern diffusion U-Nets insert self-attention layers at intermediate resolutions (e.g., 16×16 and 8×8 feature maps). This allows the network to reason about global structure — e.g., ensuring both eyes of a face are consistent — without the full quadratic cost of attending over all pixels.

Down 1
Down 2
Down 3
Middle
Attention
Up 3
Up 2
Up 1
SKIP CONNECTIONS

Restore spatial high-frequencies lost during downsampling.

TIMESTEP EMBEDDING

Injected into every ResBlock to scale activations based on noise level.

SELF-ATTENTION

Operates at 16x16 and 8x8 resolutions to capture global coherence.


7. DDPM Sampling

The full DDPM sampling algorithm, starting from xTN(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}):

xt1=1αt ⁣(xtβt1αˉtϵθ(xt,t))+σtz\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t\,\mathbf{z}

where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) (or z=0\mathbf{z} = \mathbf{0} at t=1t=1), and σt2=β~t=1αˉt11αˉtβt\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t.

Breaking down the update:

  1. Predict and remove noise: 1αt ⁣(xtβt1αˉtϵθ)\frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta\right) — a "cleaned" estimate.
  2. Re-add stochastic noise: σtz\sigma_t \mathbf{z} — ensures the sample stays on the correct marginal distribution.

With T=1000T = 1000 steps, DDPM produces excellent samples but is slow — a single image requires 1000 forward passes through the U-Net.


8. DDIM: Faster Deterministic Sampling

DDIM (Denoising Diffusion Implicit Models, Song et al., 2020) derives a non-Markovian reverse process that produces the same marginals as DDPM but can be sampled with far fewer steps.

The DDIM update rule, parameterised by η0\eta \geq 0:

xt1=αˉt1(xt1αˉtϵθαˉt)predicted x0+1αˉt1σt2ϵθ"direction pointing to xt"+σtznoise\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\underbrace{\left(\frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}}\right)}_{\text{predicted }\mathbf{x}_0} + \underbrace{\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta}_{\text{"direction pointing to }x_t\text{"}} + \underbrace{\sigma_t\,\mathbf{z}}_{\text{noise}}

where σt=η1αˉt11αˉt1αˉtαˉt1\sigma_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}}\sqrt{1-\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}.

η\etaBehaviour
η=1\eta = 1Equivalent to DDPM — fully stochastic
η=0\eta = 0Fully deterministic — same xT\mathbf{x}_T always gives the same x0\mathbf{x}_0

Setting η=0\eta = 0 gives a deterministic ODE sampler. Because the trajectory is smooth and deterministic, you can skip large chunks of the timestep sequence — sampling at only 50 or even 20 steps instead of 1000, with minimal quality loss.

The DDIM ODE

In the η=0\eta=0 limit, the reverse process is equivalent to solving an ODE defined by the neural network. This connects diffusion models to continuous normalizing flows — and means that any ODE solver (Euler, Heun, DPM-Solver) can be used for faster sampling.

x_T
x₀ (50 steps)
Deterministic Trajectory

By setting η=0, we convert the stochastic SDE into a deterministic ODE. This allows us to use larger step sizes with minimal error.

Steps: 50
Stochasticity (η): 0.0
At η=0, the model is an ODE. At η=1, it recovers the original DDPM Markov chain.

9. Score Matching: A Unified View

Parallel to the DDPM line of work, score-based generative models (Song & Ermon, 2019) approached the same idea differently — and the two frameworks turn out to be equivalent.

The Score Function

The score of a distribution p(x)p(\mathbf{x}) is the gradient of its log-density with respect to x\mathbf{x}:

s(x)=xlogp(x)\mathbf{s}(\mathbf{x}) = \nabla_\mathbf{x} \log p(\mathbf{x})

The score points in the direction of increasing probability — it tells you which way to move a sample to make it more likely. Sampling can be done by following the score (Langevin dynamics):

xi+1=xi+ϵ2xlogp(xi)+ϵz\mathbf{x}_{i+1} = \mathbf{x}_i + \frac{\epsilon}{2}\nabla_{\mathbf{x}} \log p(\mathbf{x}_i) + \sqrt{\epsilon}\,\mathbf{z}

Score Matching

We cannot compute xlogp(x)\nabla_\mathbf{x} \log p(\mathbf{x}) directly (we don't know pp). But denoising score matching shows that a network trained to predict the noise is implicitly estimating the score:

ϵθ(xt,t)1αˉtxtlogp(xt)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \approx -\sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)

Noise prediction = (scaled) score estimation. The DDPM training loss is a weighted sum of score matching objectives at all noise levels simultaneously.

The Score Field

Noise Level (σ): 0.20

The score ∇ log p(x) points toward high-density regions. As noise σ increases, the field becomes smoother, providing gradients even in low-density "dead zones."

ε_θ(x, t) ≈ -√ᾱ_t ∇ log p_t(x)

10. Conditional Generation

Unconditional diffusion generates random samples from the training distribution. Conditional generation produces samples matching a given condition c\mathbf{c} (a text prompt, a class label, a reference image).

Classifier Guidance

Train the diffusion model unconditionally, plus a separate noise-aware classifier pϕ(yxt)p_\phi(y \mid \mathbf{x}_t) that classifies noisy images at any timestep. At inference, modify the score with the classifier gradient:

ϵ~θ(xt,t,y)=ϵθ(xt,t)γ1αˉtxtlogpϕ(yxt)\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, y) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \gamma \cdot \sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t} \log p_\phi(y \mid \mathbf{x}_t)

The scalar γ\gamma is the guidance scale — higher values push samples more strongly toward the condition.

Classifier-Free Guidance (CFG)

Classifier guidance requires training a separate classifier. Classifier-Free Guidance (Ho & Salimans, 2021) eliminates this by training a single model for both conditional and unconditional generation using random conditioning dropout:

  • During training, randomly drop the condition c\mathbf{c} (replace with a null token \emptyset) with probability puncondp_\text{uncond}.
  • The model learns both ϵθ(xt,t,c)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c}) and ϵθ(xt,t,)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset).

At inference, the two predictions are combined:

ϵ~θ(xt,t,c)=ϵθ(xt,t,)unconditional+γ(ϵθ(xt,t,c)ϵθ(xt,t,))conditional direction\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, \mathbf{c}) = \underbrace{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)}_{\text{unconditional}} + \gamma \underbrace{\Big(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)\Big)}_{\text{conditional direction}}

The guidance scale γ\gamma controls the creativity vs. adherence trade-off:

γ\gammaEffect
γ=1\gamma = 1No guidance — purely conditional model
γ=7\gamma = 71515Standard use — diverse yet prompt-adherent
γ>20\gamma > 20Prompt-adherent but oversaturated, less diverse

CFG is used in virtually every text-to-image model today — Stable Diffusion, DALL-E 2, Imagen, and Flux all rely on it.

UNCONDITIONAL (ε_∅)
+
GUIDANCE SCALE (γ)
7.5
×
DIRECTION (ε_c - ε_∅)
Over-saturation

At high γ, colors become too intense and diversity drops.

High Creativity

At low γ, the model is more "artistic" but follows prompts loosely.


11. Latent Diffusion Models

Running diffusion in pixel space is expensive — a 512×512 RGB image has ~786K dimensions. Latent Diffusion Models (LDM, Rombach et al., 2022 — the basis of Stable Diffusion) solve this by running diffusion in a compressed latent space.

Two-Stage Architecture

Stage 1 — Train a VQ-regularised Autoencoder:

A convolutional encoder E\mathcal{E} compresses the image into a latent representation z=E(x)\mathbf{z} = \mathcal{E}(\mathbf{x}) — typically 64×64×464\times64\times4 for a 512×512512\times512 input (a 48×48\times compression). A decoder D\mathcal{D} reconstructs the image: x~=D(z)\tilde{\mathbf{x}} = \mathcal{D}(\mathbf{z}). The autoencoder is trained with a perceptual loss and a patch-based GAN discriminator.

Stage 2 — Train a Diffusion Model in Latent Space:

The diffusion model ϵθ\boldsymbol{\epsilon}_\theta operates entirely on the compressed latent z\mathbf{z}, not the full image. Sampling:

  1. Sample zTN(0,I)\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) in the latent space.
  2. Denoise zTz0\mathbf{z}_T \to \mathbf{z}_0 using the diffusion model (DDIM, ~50 steps).
  3. Decode: x0=D(z0)\mathbf{x}_0 = \mathcal{D}(\mathbf{z}_0).

Cross-Attention for Text Conditioning

Text prompts are encoded by a frozen CLIP or T5 text encoder into a sequence of token embeddings τθ(y)\boldsymbol{\tau}_\theta(\mathbf{y}). These are injected into the U-Net via cross-attention at every resolution:

Attention(Q,K,V)=softmax ⁣(QKTd)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d}}\right)V Q=WQht,K=WKτθ(y),V=WVτθ(y)Q = W_Q \mathbf{h}_t, \quad K = W_K \boldsymbol{\tau}_\theta(\mathbf{y}), \quad V = W_V \boldsymbol{\tau}_\theta(\mathbf{y})

The latent features ht\mathbf{h}_t attend over the text token embeddings, allowing every spatial location to selectively read from relevant parts of the prompt.

Stage 1: The Autoencoder

We use a VAE to compress a 512x512 image into a 64x64 latent grid. This reduces the search space by 48x.

512px Image
64px Latent
Stage 2: Latent Diffusion

The U-Net only sees the 64x64 latent. This makes training and sampling exponentially faster and cheaper.

DENOISING...
Conditioning

Text embeddings are injected into the latent diffusion process via Cross-Attention layers.


12. Key Diffusion Model Families

ModelYearKey Contribution
DDPM2020Simplified training — just predict noise (MSE)
Improved DDPM2021Cosine schedule; learned variance
DDIM2021Deterministic non-Markovian sampling; 10–50× faster
Score SDE2021Continuous-time SDE formulation; unified framework
Classifier Guidance2021Use a classifier gradient to steer generation
DALL-E 22022CLIP image embeddings as conditioning
Latent Diffusion / SD2022Diffusion in VAE latent space; open-source
Imagen2022Cascaded pixel diffusion + large T5 text encoder
DiT2022Transformer (not U-Net) as the denoising backbone
SDXL2023Larger LDM with two text encoders; multi-aspect
Flow Matching2022–23Straight ODE trajectories; simpler training than diffusion
SD3 / Flux2024Multimodal Diffusion Transformer (MMDiT); flow matching

13. The Diffusion Transformer (DiT)

While U-Net dominated the first generation of diffusion models, DiT (Peebles & Xie, 2022) replaced it with a pure Transformer backbone.

The image is split into patches, linearly projected into token embeddings, and processed by a standard Transformer with adaLN-Zero conditioning — the timestep and class label modulate the LayerNorm scale and shift parameters via a small MLP.

DiT scales more predictably than U-Net (FID improves smoothly with compute) and forms the backbone of Stable Diffusion 3 and Flux.


14. Comparing Diffusion to Other Generative Models

PropertyDiffusionGANVAENormalizing FlowAutoregressive
Sample quality (images)BestHighMediumLow–MedMedium
Sampling speedSlow (steps)FastFastFastSlow (seq.)
Training stabilityExcellentPoorGoodGoodGood
LikelihoodBound (ELBO)NoBoundExactExact
Mode coverageExcellentOften poorGoodGoodGood
Conditional controlExcellent (CFG)ModerateLimitedLimitedGood
Latent spaceNoisy xT\mathbf{x}_TImplicitStructuredExactNo

Diffusion models trade sampling speed for everything else — they dominate on quality, diversity, and conditional control, at the cost of requiring hundreds to thousands of network evaluations per sample. Techniques like DDIM, DPM-Solver, consistency models, and flow matching progressively close this speed gap.

Consistency Models

A newer paradigm (Song et al., 2023) trains a model to map any point on the diffusion trajectory directly to x0\mathbf{x}_0, enabling single-step generation. Consistency distillation can produce near-DDPM quality in 1–4 steps, making real-time diffusion practical.


Summary

ConceptKey Idea
Forward processFixed Markov chain: q(xtx0)=N(αˉtx0,(1αˉt)I)q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})
Closed-form samplext=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} — skip to any step instantly
Reverse processLearned Gaussian pθ(xt1xt)p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) — iterate from noise to data
Training objectivePredict the noise ϵ\boldsymbol{\epsilon}: L=E[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2]
U-NetEncoder–decoder with skip connections and timestep conditioning
DDIMDeterministic ODE sampler — same quality in 50 steps instead of 1000
Score matchingNoise prediction \approx scaled score logp\nabla \log p — two views of the same model
Classifier-free guidanceϵ~=ϵuncond+γ(ϵcondϵuncond)\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\text{uncond} + \gamma(\boldsymbol{\epsilon}_\text{cond} - \boldsymbol{\epsilon}_\text{uncond}) — fidelity vs. diversity knob
Latent diffusionRun DDPM in VAE latent space — 48× fewer dimensions, same quality
DiTTransformer backbone replacing U-Net — scales predictably with compute

On this page