How gradually adding and then reversing noise teaches a neural network to generate data — from the forward Markov chain to classifier-free guidance and latent diffusion.

Diffusion Models

What if generating an image was nothing more than learning to clean up noise?

Diffusion models are built on exactly this idea. Rather than learning a generator from scratch, they learn the reverse of a simple, known process: gradually adding Gaussian noise to data until it becomes pure static. A neural network trained to undo this noise one small step at a time turns out to be an extraordinarily powerful generative model.

First proposed in a probabilistic framework by Sohl-Dickstein et al. (2015) and brought to state-of-the-art image quality by DDPM (Ho et al., 2020), diffusion models now power Stable Diffusion, DALL-E 2, Midjourney, Imagen, and Sora.

1. The Core Intuition

Imagine taking a photograph and gradually blurring and adding static to it — first a little, then more and more — until after enough steps it looks like pure random noise. This is the forward process: deterministic in distribution, easy to simulate, and completely destroys the information in the original image.

Now reverse it. If you could learn to take a noisy image and predict what a slightly less noisy version would look like, you could run that operation hundreds of times — starting from pure noise and ending at a realistic image. This is the reverse process, and it is what a diffusion model learns.

The key insight: the forward process is fixed and known. We do not learn it. We only learn the reverse. And because each denoising step is small, the reverse distribution at each step is approximately Gaussian — making it easy to parameterise with a neural network.

ENTROPY

x₀ (Clean)t = 0x_T (Noise)

x_t = 1.00x₀ + 0.00ε

2. The Forward Process

The forward process is a fixed Markov chain that adds a small amount of Gaussian noise at each of $T$ steps. Given a data point $\mathbf{x}_0 \sim q(\mathbf{x})$ , successive noisy versions are sampled as:

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\right)

At each step, the previous sample is slightly shrunk (by $\sqrt{1-\beta_t}$ ) and a small amount of noise (with variance $\beta_t$ ) is added. The sequence $\beta_1, \beta_2, \dots, \beta_T$ is the noise schedule — a fixed, pre-chosen set of small positive numbers.

The Closed-Form Forward Sample

The elegance of this Markov chain is that you can sample $\mathbf{x}_t$ at any arbitrary step $t$ directly from $\mathbf{x}_0$ , without simulating all intermediate steps. Defining:

\alpha_t = 1 - \beta_t, \qquad \bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s

the marginal at step $t$ is:

q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\left(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1-\bar{\alpha}_t)\mathbf{I}\right)

This means any noisy sample can be written as a linear combination of the clean image and a standard Gaussian noise vector $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ :

\boxed{\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}}

Term	Meaning
$\sqrt{\bar{\alpha}_t}$	Signal coefficient — how much of the original image survives
$\sqrt{1-\bar{\alpha}_t}$	Noise coefficient — how much noise has been added
$\bar{\alpha}_t \to 0$ as $t \to T$	At the final step, image is pure noise

Why This Is So Useful

This closed-form means training is trivially parallelised: for any data point $\mathbf{x}_0$ and any step $t$ , we can immediately compute a noisy sample $\mathbf{x}_t$ in one shot. No need to simulate the chain step by step.

3. The Noise Schedule

The schedule $\{\beta_t\}$ controls how quickly information is destroyed. Two common choices:

Linear Schedule (DDPM)

\beta_t = \beta_\text{start} + \frac{t-1}{T-1}(\beta_\text{end} - \beta_\text{start}), \quad \beta_\text{start} = 10^{-4},\; \beta_\text{end} = 0.02

Simple but suboptimal — it destroys signal too quickly at the start, wasting many steps on nearly-pure noise.

Cosine Schedule (Improved DDPM)

\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos^2\!\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)

Keeps $\bar{\alpha}_t$ near 1 for longer, then drops more smoothly — the network spends more steps in the perceptually meaningful regime and fewer steps on nearly-pure noise.

Schedule Analysis

Signal Survival (√ᾱ_t)

Noise Cumulative (√(1-ᾱ_t))

4. The Reverse Process

The forward process is fixed and known. The reverse is what we learn.

The true reverse — the distribution of $\mathbf{x}_{t-1}$ given $\mathbf{x}_t$ — is intractable because it requires knowing $p(\mathbf{x})$ . However, if the forward steps are small enough, the reverse is approximately Gaussian:

p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\left(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \sigma_t^2 \mathbf{I}\right)

We parameterise the mean $\boldsymbol{\mu}_\theta$ with a neural network. Generation then proceeds as:

\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}, \qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

starting from $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and iterating down to $\mathbf{x}_0$ .

The Posterior Given $\mathbf{x}_0$

The key mathematical fact: when conditioned on the clean image $\mathbf{x}_0$ , the reverse distribution is tractable and Gaussian:

q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1};\; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0),\; \tilde{\beta}_t \mathbf{I})

\tilde{\boldsymbol{\mu}}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t, \qquad \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t

This posterior will be the target for our training objective.

Timestept = 100

SNR Tracking

"The network predicts ε at each step, which we subtract to recover the structured signal."

5. The Training Objective

From ELBO to a Simple Loss

Like VAEs, the training objective is the ELBO on the log-likelihood $\log p_\theta(\mathbf{x}_0)$ . After expanding through the Markov chain and simplifying, the ELBO decomposes into:

\mathcal{L} = \mathcal{L}_T + \sum_{t=2}^{T} \mathcal{L}_{t-1} + \mathcal{L}_0

where each $\mathcal{L}_{t-1}$ is a KL divergence between the true reverse posterior $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ and the learned $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ .

Since both are Gaussian, the KL has a closed form — it reduces to the squared difference between their means. Substituting the closed-form expression for $\tilde{\boldsymbol{\mu}}_t$ and reparameterising in terms of the noise $\boldsymbol{\epsilon}$ , Ho et al. showed that a simplified loss works better in practice:

\boxed{\mathcal{L}_\text{simple} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\!\left[\left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\right) \right\|^2\right]}

In plain English: sample a clean image, a random noise vector, and a random timestep $t$ ; corrupt the image using the closed-form forward equation; then train the network to predict the noise that was added.

This is just a mean-squared error regression. No adversarial training, no latent variable inference — just supervised regression on noise prediction.

Three Equivalent Prediction Targets

A diffusion network can equivalently predict:

Target	Formula	Notes
Noise $\boldsymbol{\epsilon}$	$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$	Ho et al. (DDPM) — default
Clean image $\mathbf{x}_0$	$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}}$	Deduced from noise prediction
Score $\nabla_{\mathbf{x}} \log p(\mathbf{x}_t)$	$-\frac{\boldsymbol{\epsilon}_\theta}{\sqrt{1-\bar{\alpha}_t}}$	Song et al. — score matching view

All three are mathematically equivalent and are related by simple linear transforms.

6. The Denoising Network: U-Net

The network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ takes a noisy image and a timestep as input and predicts the noise. The architecture of choice is the U-Net, originally designed for biomedical image segmentation.

Why U-Net?

A good noise predictor needs two properties:

Global context — knowing the overall structure of the image to predict coherent noise.
Fine-grained spatial output — the output is the same size as the input (pixel-level noise map).

U-Net satisfies both via an encoder–decoder structure with skip connections:

Input x_t
    │
[Conv] → [Downsample] → [Conv] → [Downsample] → ... → [Bottleneck]
                                                              │
[Conv] ← [Upsample]  ← [Conv] ← [Upsample]  ← ... ←────────┘
    │         ↑               ↑
    └── skip ─┘               └── skip ────────┘
    │
Output ε̂ (same shape as x_t)

Skip connections concatenate encoder feature maps directly into the decoder, preserving fine spatial detail that would otherwise be lost through downsampling.

Timestep Conditioning

The timestep $t$ is injected into every residual block via a sinusoidal embedding (identical in form to Transformer positional embeddings):

\text{emb}(t)_{2i} = \sin(t / 10000^{2i/d}), \qquad \text{emb}(t)_{2i+1} = \cos(t / 10000^{2i/d})

This embedding is projected through a small MLP and added (or concatenated) to the feature maps at each resolution, telling the network which noise level it is operating at.

Attention in U-Net

Modern diffusion U-Nets insert self-attention layers at intermediate resolutions (e.g., 16×16 and 8×8 feature maps). This allows the network to reason about global structure — e.g., ensuring both eyes of a face are consistent — without the full quadratic cost of attending over all pixels.

Down 1

Down 2

Down 3

Middle

Attention

Up 3

Up 2

Up 1

SKIP CONNECTIONS

Restore spatial high-frequencies lost during downsampling.

TIMESTEP EMBEDDING

Injected into every ResBlock to scale activations based on noise level.

SELF-ATTENTION

Operates at 16x16 and 8x8 resolutions to capture global coherence.

7. DDPM Sampling

The full DDPM sampling algorithm, starting from $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ :

\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t\,\mathbf{z}

where $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ (or $\mathbf{z} = \mathbf{0}$ at $t=1$ ), and $\sigma_t^2 = \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t$ .

Breaking down the update:

Predict and remove noise: $\frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta\right)$ — a "cleaned" estimate.
Re-add stochastic noise: $\sigma_t \mathbf{z}$ — ensures the sample stays on the correct marginal distribution.

With $T = 1000$ steps, DDPM produces excellent samples but is slow — a single image requires 1000 forward passes through the U-Net.

8. DDIM: Faster Deterministic Sampling

DDIM (Denoising Diffusion Implicit Models, Song et al., 2020) derives a non-Markovian reverse process that produces the same marginals as DDPM but can be sampled with far fewer steps.

The DDIM update rule, parameterised by $\eta \geq 0$ :

\mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\underbrace{\left(\frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}}\right)}_{\text{predicted }\mathbf{x}_0} + \underbrace{\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta}_{\text{"direction pointing to }x_t\text{"}} + \underbrace{\sigma_t\,\mathbf{z}}_{\text{noise}}

where $\sigma_t = \eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}}\sqrt{1-\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}}$ .

$\eta$	Behaviour
$\eta = 1$	Equivalent to DDPM — fully stochastic
$\eta = 0$	Fully deterministic — same $\mathbf{x}_T$ always gives the same $\mathbf{x}_0$

Setting $\eta = 0$ gives a deterministic ODE sampler. Because the trajectory is smooth and deterministic, you can skip large chunks of the timestep sequence — sampling at only 50 or even 20 steps instead of 1000, with minimal quality loss.

The DDIM ODE

In the $\eta=0$ limit, the reverse process is equivalent to solving an ODE defined by the neural network. This connects diffusion models to continuous normalizing flows — and means that any ODE solver (Euler, Heun, DPM-Solver) can be used for faster sampling.

x_T

x₀ (50 steps)

Deterministic Trajectory

By setting η=0, we convert the stochastic SDE into a deterministic ODE. This allows us to use larger step sizes with minimal error.

Steps: 50

Stochasticity (η): 0.0

At η=0, the model is an ODE. At η=1, it recovers the original DDPM Markov chain.

9. Score Matching: A Unified View

Parallel to the DDPM line of work, score-based generative models (Song & Ermon, 2019) approached the same idea differently — and the two frameworks turn out to be equivalent.

The Score Function

The score of a distribution $p(\mathbf{x})$ is the gradient of its log-density with respect to $\mathbf{x}$ :

\mathbf{s}(\mathbf{x}) = \nabla_\mathbf{x} \log p(\mathbf{x})

The score points in the direction of increasing probability — it tells you which way to move a sample to make it more likely. Sampling can be done by following the score (Langevin dynamics):

\mathbf{x}_{i+1} = \mathbf{x}_i + \frac{\epsilon}{2}\nabla_{\mathbf{x}} \log p(\mathbf{x}_i) + \sqrt{\epsilon}\,\mathbf{z}

Score Matching

We cannot compute $\nabla_\mathbf{x} \log p(\mathbf{x})$ directly (we don't know $p$ ). But denoising score matching shows that a network trained to predict the noise is implicitly estimating the score:

\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \approx -\sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)

Noise prediction = (scaled) score estimation. The DDPM training loss is a weighted sum of score matching objectives at all noise levels simultaneously.

The Score Field

Noise Level (σ): 0.20

The score ∇ log p(x) points toward high-density regions. As noise σ increases, the field becomes smoother, providing gradients even in low-density "dead zones."

ε_θ(x, t) ≈ -√ᾱ_t ∇ log p_t(x)

10. Conditional Generation

Unconditional diffusion generates random samples from the training distribution. Conditional generation produces samples matching a given condition $\mathbf{c}$ (a text prompt, a class label, a reference image).

Classifier Guidance

Train the diffusion model unconditionally, plus a separate noise-aware classifier $p_\phi(y \mid \mathbf{x}_t)$ that classifies noisy images at any timestep. At inference, modify the score with the classifier gradient:

\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, y) = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \gamma \cdot \sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t} \log p_\phi(y \mid \mathbf{x}_t)

The scalar $\gamma$ is the guidance scale — higher values push samples more strongly toward the condition.

Classifier-Free Guidance (CFG)

Classifier guidance requires training a separate classifier. Classifier-Free Guidance (Ho & Salimans, 2021) eliminates this by training a single model for both conditional and unconditional generation using random conditioning dropout:

During training, randomly drop the condition $\mathbf{c}$ (replace with a null token $\emptyset$ ) with probability $p_\text{uncond}$ .
The model learns both $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c})$ and $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)$ .

At inference, the two predictions are combined:

\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, \mathbf{c}) = \underbrace{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)}_{\text{unconditional}} + \gamma \underbrace{\Big(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \mathbf{c}) - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \emptyset)\Big)}_{\text{conditional direction}}

The guidance scale $\gamma$ controls the creativity vs. adherence trade-off:

$\gamma$	Effect
$\gamma = 1$	No guidance — purely conditional model
$\gamma = 7$ – $15$	Standard use — diverse yet prompt-adherent
$\gamma > 20$	Prompt-adherent but oversaturated, less diverse

CFG is used in virtually every text-to-image model today — Stable Diffusion, DALL-E 2, Imagen, and Flux all rely on it.

UNCONDITIONAL (ε_∅)

GUIDANCE SCALE (γ)

7.5

DIRECTION (ε_c - ε_∅)

Over-saturation

At high γ, colors become too intense and diversity drops.

High Creativity

At low γ, the model is more "artistic" but follows prompts loosely.

11. Latent Diffusion Models

Running diffusion in pixel space is expensive — a 512×512 RGB image has ~786K dimensions. Latent Diffusion Models (LDM, Rombach et al., 2022 — the basis of Stable Diffusion) solve this by running diffusion in a compressed latent space.

Two-Stage Architecture

Stage 1 — Train a VQ-regularised Autoencoder:

A convolutional encoder $\mathcal{E}$ compresses the image into a latent representation $\mathbf{z} = \mathcal{E}(\mathbf{x})$ — typically $64\times64\times4$ for a $512\times512$ input (a $48\times$ compression). A decoder $\mathcal{D}$ reconstructs the image: $\tilde{\mathbf{x}} = \mathcal{D}(\mathbf{z})$ . The autoencoder is trained with a perceptual loss and a patch-based GAN discriminator.

Stage 2 — Train a Diffusion Model in Latent Space:

The diffusion model $\boldsymbol{\epsilon}_\theta$ operates entirely on the compressed latent $\mathbf{z}$ , not the full image. Sampling:

Sample $\mathbf{z}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ in the latent space.
Denoise $\mathbf{z}_T \to \mathbf{z}_0$ using the diffusion model (DDIM, ~50 steps).
Decode: $\mathbf{x}_0 = \mathcal{D}(\mathbf{z}_0)$ .

Cross-Attention for Text Conditioning

Text prompts are encoded by a frozen CLIP or T5 text encoder into a sequence of token embeddings $\boldsymbol{\tau}_\theta(\mathbf{y})$ . These are injected into the U-Net via cross-attention at every resolution:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d}}\right)V

Q = W_Q \mathbf{h}_t, \quad K = W_K \boldsymbol{\tau}_\theta(\mathbf{y}), \quad V = W_V \boldsymbol{\tau}_\theta(\mathbf{y})

The latent features $\mathbf{h}_t$ attend over the text token embeddings, allowing every spatial location to selectively read from relevant parts of the prompt.

Stage 1: The Autoencoder

We use a VAE to compress a 512x512 image into a 64x64 latent grid. This reduces the search space by 48x.

512px Image

64px Latent

Stage 2: Latent Diffusion

The U-Net only sees the 64x64 latent. This makes training and sampling exponentially faster and cheaper.

DENOISING...

Conditioning

Text embeddings are injected into the latent diffusion process via Cross-Attention layers.

12. Key Diffusion Model Families

Model	Year	Key Contribution
DDPM	2020	Simplified training — just predict noise (MSE)
Improved DDPM	2021	Cosine schedule; learned variance
DDIM	2021	Deterministic non-Markovian sampling; 10–50× faster
Score SDE	2021	Continuous-time SDE formulation; unified framework
Classifier Guidance	2021	Use a classifier gradient to steer generation
DALL-E 2	2022	CLIP image embeddings as conditioning
Latent Diffusion / SD	2022	Diffusion in VAE latent space; open-source
Imagen	2022	Cascaded pixel diffusion + large T5 text encoder
DiT	2022	Transformer (not U-Net) as the denoising backbone
SDXL	2023	Larger LDM with two text encoders; multi-aspect
Flow Matching	2022–23	Straight ODE trajectories; simpler training than diffusion
SD3 / Flux	2024	Multimodal Diffusion Transformer (MMDiT); flow matching

13. The Diffusion Transformer (DiT)

While U-Net dominated the first generation of diffusion models, DiT (Peebles & Xie, 2022) replaced it with a pure Transformer backbone.

The image is split into patches, linearly projected into token embeddings, and processed by a standard Transformer with adaLN-Zero conditioning — the timestep and class label modulate the LayerNorm scale and shift parameters via a small MLP.

DiT scales more predictably than U-Net (FID improves smoothly with compute) and forms the backbone of Stable Diffusion 3 and Flux.

14. Comparing Diffusion to Other Generative Models

Property	Diffusion	GAN	VAE	Normalizing Flow	Autoregressive
Sample quality (images)	Best	High	Medium	Low–Med	Medium
Sampling speed	Slow (steps)	Fast	Fast	Fast	Slow (seq.)
Training stability	Excellent	Poor	Good	Good	Good
Likelihood	Bound (ELBO)	No	Bound	Exact	Exact
Mode coverage	Excellent	Often poor	Good	Good	Good
Conditional control	Excellent (CFG)	Moderate	Limited	Limited	Good
Latent space	Noisy $\mathbf{x}_T$	Implicit	Structured	Exact	No

Diffusion models trade sampling speed for everything else — they dominate on quality, diversity, and conditional control, at the cost of requiring hundreds to thousands of network evaluations per sample. Techniques like DDIM, DPM-Solver, consistency models, and flow matching progressively close this speed gap.

Consistency Models

A newer paradigm (Song et al., 2023) trains a model to map any point on the diffusion trajectory directly to $\mathbf{x}_0$ , enabling single-step generation. Consistency distillation can produce near-DDPM quality in 1–4 steps, making real-time diffusion practical.

Summary

Concept	Key Idea
Forward process	Fixed Markov chain: $q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$
Closed-form sample	$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ — skip to any step instantly
Reverse process	Learned Gaussian $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ — iterate from noise to data
Training objective	Predict the noise $\boldsymbol{\epsilon}$ : $\mathcal{L} = \mathbb{E}[\\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\\|^2]$
U-Net	Encoder–decoder with skip connections and timestep conditioning
DDIM	Deterministic ODE sampler — same quality in 50 steps instead of 1000
Score matching	Noise prediction $\approx$ scaled score $\nabla \log p$ — two views of the same model
Classifier-free guidance	$\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\text{uncond} + \gamma(\boldsymbol{\epsilon}_\text{cond} - \boldsymbol{\epsilon}_\text{uncond})$ — fidelity vs. diversity knob
Latent diffusion	Run DDPM in VAE latent space — 48× fewer dimensions, same quality
DiT	Transformer backbone replacing U-Net — scales predictably with compute

Diffusion Models

Schedule Analysis

The Score Field

On this page