My App
Generative Models

Variational Autoencoders

How VAEs learn structured latent spaces by combining neural networks with Bayesian inference — from the ELBO to the reparameterisation trick and beyond.

Variational Autoencoders

A Variational Autoencoder (VAE) is a generative model that learns a compressed, structured representation of data — called a latent space — and can generate new data by sampling from it. Introduced by Kingma & Welling (2013) and Rezende, Mohamed & Wierstra (2014), it remains one of the most influential ideas in deep generative modelling.

What makes a VAE different from a standard autoencoder is a single, powerful idea: instead of mapping each data point to a single latent vector, the encoder maps it to a probability distribution over the latent space. This forces the latent space to be continuous and structured — making it possible to generate new samples simply by sampling a latent code and decoding it.


1. Starting Point: The Standard Autoencoder

Before understanding what makes VAEs variational, it helps to understand a plain autoencoder.

An autoencoder has two parts:

  • Encoder qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}): compresses input x\mathbf{x} into a low-dimensional code z\mathbf{z}.
  • Decoder pθ(xz)p_\theta(\mathbf{x} \mid \mathbf{z}): reconstructs x\mathbf{x} from z\mathbf{z}.

The training objective is simply reconstruction — minimise the difference between the input and its reconstruction.

The Problem with Plain Autoencoders

A standard autoencoder maps each data point to a single point in latent space. The latent codes are scattered across Rd\mathbb{R}^d with no imposed structure. If you sample a random point z\mathbf{z} from the latent space and decode it, the decoder has no idea what to do — it was never trained to handle points between the learned codes.

You cannot use a plain autoencoder as a generative model.

Standard Autoencoder

Discrete codes: Gap = Garbage

VAE (Variational)

Continuous distributions: Overlap = Structure

2. The Generative View

A VAE starts from a probabilistic generative story:

  1. Sample a latent code from the prior: zp(z)=N(0,I)\mathbf{z} \sim p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})
  2. Decode it into data via the likelihood: xpθ(xz)\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})

The model is parameterised by θ\theta (the decoder network). We want to find θ\theta that maximises the marginal likelihood of the data:

pθ(x)=pθ(xz)p(z)dzp_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}

This integral is intractable — integrating over all possible latent codes is computationally impossible for high-dimensional z\mathbf{z}.

Why Is This Integral Hard?

For a 100-dimensional latent space, the integral is over R100\mathbb{R}^{100}. There is no closed form, and numerical quadrature is exponentially expensive in dimension. The VAE sidesteps this with approximate posterior inference.


3. Approximate Inference: The Encoder

The key insight of VAEs: instead of integrating over all z\mathbf{z}, introduce a learned approximate posterior qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}) — the encoder — that tries to match the true (intractable) posterior pθ(zx)p_\theta(\mathbf{z} \mid \mathbf{x}).

In practice, the encoder is a neural network that outputs the parameters of a diagonal Gaussian:

qϕ(zx)=N(zμϕ(x),diag(σϕ2(x)))q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\mathbf{z} \mid \boldsymbol{\mu}_\phi(\mathbf{x}),\, \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x})))

The encoder takes x\mathbf{x} and outputs two vectors: a mean μ\boldsymbol{\mu} and a log-variance logσ2\log \boldsymbol{\sigma}^2 (log is used for numerical stability), both of dimension dzd_z.

1
Input (x)
2
Encode (μ, σ)
3
Sample (z)
4
Decode (x̂)

4. The Evidence Lower Bound (ELBO)

Since logpθ(x)\log p_\theta(\mathbf{x}) is intractable, we derive a tractable lower bound using Jensen's inequality.

Starting from the log-likelihood:

logpθ(x)=logpθ(xz)p(z)dz\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}

Multiply and divide by qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}):

=logEqϕ(zx)[pθ(xz)p(z)qϕ(zx)]= \log \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \frac{p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})} \right]

Apply Jensen's inequality (logE[Y]E[logY]\log \mathbb{E}[Y] \geq \mathbb{E}[\log Y]):

Eqϕ(zx)[logpθ(xz)]DKL ⁣(qϕ(zx)p(z))\geq \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right] - D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)

This lower bound is the ELBO (Evidence Lower BOund):

L(θ,ϕ;x)=Eqϕ(zx)[logpθ(xz)]Reconstruction termDKL ⁣(qϕ(zx)p(z))Regularisation term\boxed{\mathcal{L}(\theta, \phi;\, \mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right]}_{\text{Reconstruction term}} - \underbrace{D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularisation term}}}

Training maximises L\mathcal{L} with respect to both θ\theta (decoder) and ϕ\phi (encoder).

The Gap

The ELBO is always a lower bound on logpθ(x)\log p_\theta(\mathbf{x}). The gap between them is exactly the KL divergence between the approximate and true posterior:

logpθ(x)=L(θ,ϕ;x)+DKL ⁣(qϕ(zx)pθ(zx))\log p_\theta(\mathbf{x}) = \mathcal{L}(\theta, \phi;\, \mathbf{x}) + D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p_\theta(\mathbf{z} \mid \mathbf{x}) \right)

The better the encoder approximates the true posterior, the tighter the bound.


5. Dissecting the Two ELBO Terms

Reconstruction Term

Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right]

This measures how well the decoder reconstructs the input from the latent code. Concretely:

  • For binary data (e.g., binarised MNIST): pθ(xz)p_\theta(\mathbf{x} \mid \mathbf{z}) is Bernoulli → reconstruction loss is binary cross-entropy.
  • For continuous data (e.g., natural images): pθ(xz)p_\theta(\mathbf{x} \mid \mathbf{z}) is Gaussian → reconstruction loss is mean squared error.

This term pushes the model to encode and decode data faithfully.

Regularisation Term (KL Divergence)

DKL ⁣(qϕ(zx)p(z))D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)

This measures how far the learned posterior is from the standard Gaussian prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}).

For diagonal Gaussians, this term has a closed-form solution that requires no Monte Carlo estimation:

DKL ⁣(N(μ,σ2)N(0,I))=12j=1dz(μj2+σj2logσj21)D_\text{KL}\!\left( \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2) \,\|\, \mathcal{N}(\mathbf{0}, \mathbf{I}) \right) = \frac{1}{2} \sum_{j=1}^{d_z} \left( \mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right)

This term acts as a regulariser: it prevents the encoder from mapping all data points to extremely narrow, non-overlapping Gaussians (which would turn the VAE back into a plain autoencoder with no usable latent space).

Loss Tension (β-VAE)

β = 1.0
Low β: Prioritizes detail, messy latent space.
High β: Prioritizes structure, blurry reconstructions.

6. The KL Divergence in Detail

It helps to build intuition for what DKL(qp)D_\text{KL}(q \| p) is doing geometrically.

DKL(qp)=Eq[logq(z)p(z)]0D_\text{KL}(q \| p) = \mathbb{E}_q \left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right] \geq 0

It is zero if and only if q=pq = p everywhere. For our diagonal Gaussian case:

What KL penalisesWhy
Large μj\mu_j (far from origin)Pulls all posterior means toward 0
Very small σj\sigma_j (near-delta posterior)Prevents the encoder from being over-confident
Very large σj\sigma_jPrevents the posterior from being too diffuse

The KL term therefore enforces that the approximate posterior stays close to the prior — a standard Gaussian centred at the origin. This is what creates the smooth, connected latent space that enables generation.

KL Divergence Score
0.0000
Mean (μ): 0.0
Std Dev (σ): 1.0

7. The Reparameterisation Trick

Here lies a critical technical challenge: the ELBO requires computing gradients with respect to ϕ\phi (the encoder parameters), but the reconstruction term involves an expectation under qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}) — a distribution that itself depends on ϕ\phi.

Sampling is not differentiable. You cannot backpropagate through a sampling operation directly.

The Solution

Write the sample z\mathbf{z} as a deterministic function of ϕ\phi and a noise variable ϵ\boldsymbol{\epsilon} that is independent of ϕ\phi:

z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I)\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Now the randomness lives entirely in ϵ\boldsymbol{\epsilon}, which does not depend on ϕ\phi. The gradient zϕ\frac{\partial \mathbf{z}}{\partial \phi} flows cleanly through μϕ\boldsymbol{\mu}_\phi and σϕ\boldsymbol{\sigma}_\phi.

zqϕ(zx)not differentiablez=μϕ(x)+σϕ(x)ϵdifferentiable w.r.t. ϕ\underbrace{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}_{\text{not differentiable}} \quad \Longrightarrow \quad \underbrace{\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}}_{\text{differentiable w.r.t. } \phi}
Gradient Blocked
Encoder (φ)
Sampling
Node
Decoder (θ)
Reparameterised
μ, σ (φ)
z
Decoder (θ)
ε
Static Noise

8. The Full Training Pipeline

Putting it all together, one training step for a single data point x\mathbf{x}:

  1. Encode — feed x\mathbf{x} through the encoder network to get μϕ,σϕ\boldsymbol{\mu}_\phi, \boldsymbol{\sigma}_\phi.
  2. Sample — draw ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and compute z=μϕ+σϕϵ\mathbf{z} = \boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\epsilon}.
  3. Decode — feed z\mathbf{z} through the decoder network to get the reconstruction x^\hat{\mathbf{x}}.
  4. Compute ELBO — evaluate reconstruction loss + KL divergence.
  5. Backpropagate — gradients flow through x^\hat{\mathbf{x}}, z\mathbf{z}, μ\boldsymbol{\mu}, σ\boldsymbol{\sigma}, back into encoder and decoder weights.

In minibatch training, the ELBO for a dataset D\mathcal{D} is approximated as:

L1Ni=1N[logpθ(x(i)z(i))reconstructionDKL(qϕ(zx(i))p(z))closed form]\mathcal{L} \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \underbrace{\log p_\theta(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i)})}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}^{(i)}) \| p(\mathbf{z}))}_{\text{closed form}} \right]

where z(i)\mathbf{z}^{(i)} is sampled once per data point per step using the reparameterisation trick.


9. The Latent Space

The latent space is the heart of the VAE. Because the KL term regularises all posterior distributions toward the standard Gaussian prior, the latent space tends to have three desirable properties:

Continuity

Nearby points in latent space decode to similar outputs. Small perturbations to z\mathbf{z} produce small changes in x^\hat{\mathbf{x}}.

Completeness

Every point sampled from p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}) decodes to a plausible output. The decoder is trained to handle the entire region near the origin, not just isolated point-codes.

Interpolation

Walking a straight line between two latent codes zA\mathbf{z}_A and zB\mathbf{z}_B produces a semantically meaningful interpolation between their corresponding data points.

z(λ)=(1λ)zA+λzB,λ[0,1]\mathbf{z}(\lambda) = (1 - \lambda) \mathbf{z}_A + \lambda \mathbf{z}_B, \quad \lambda \in [0, 1]
2D Latent Space (z)
z = [0.00, 0.00]
Decoded Output (x̂)
Move your mouse to traverse the manifold and see the samples morph.

10. Generation

To generate a new sample:

  1. Sample a latent code from the prior: zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
  2. Decode: x^=decoderθ(z)\hat{\mathbf{x}} = \text{decoder}_\theta(\mathbf{z})

No encoder is needed at generation time. This is the key advantage over autoregressive models, where generation is sequential — a VAE decoder can generate an entire output in a single forward pass.

z_A
Semantic Walk
z_B

11. Posterior Collapse

A notorious failure mode of VAEs is posterior collapse: the encoder ignores the input and maps every x\mathbf{x} to the prior qϕ(zx)p(z)=N(0,I)q_\phi(\mathbf{z} \mid \mathbf{x}) \approx p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}).

When this happens:

  • The KL term drops to zero (the encoder is "perfect" from the regulariser's perspective).
  • The decoder learns to ignore z\mathbf{z} entirely and generates data from the marginal distribution.
  • The latent space carries no information about the input.

Why It Happens

The KL term always "wants" q=pq = p (zero KL is optimal). If the decoder is powerful enough (e.g., a Transformer or PixelCNN), it can achieve good reconstruction without any information from z\mathbf{z}, making the encoder redundant.

Mitigations

TechniqueMechanism
KL annealingStart with β = 0, slowly increase to 1 during training
Free bitsOnly penalise KL when it drops below a threshold λ\lambda per dimension
Weakened decoderDeliberately limit decoder capacity so it must use z\mathbf{z}
δ-VAEEnsure a minimum KL per latent dimension
Latent Encodings

The encoder separates inputs into distinct regions.

Reconstructions

Decoder uses latent code to reconstruct specific details.


12. Beyond the Vanilla VAE

β-VAE: Disentanglement

Lβ=Eqϕ[logpθ(xz)]βDKL(qϕ(zx)p(z)),β>1\mathcal{L}_\beta = \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - \beta \cdot D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z})), \quad \beta > 1

A higher β forces the posterior to be even closer to the factorial prior N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}), encouraging each latent dimension to encode an independent, disentangled factor of variation (shape, colour, orientation, etc.).

Latent Traversals & Disentanglement

Varying $z_j$ from -3 to +3 while fixing other dimensions at 0.

Pressure (β)1 (Vanilla)
EntangledDisentangled
Dimension 1
-3
-2
-1
0
+1
+2
+3
Dimension 2
-3
-2
-1
0
+1
+2
+3
Dimension 3
-3
-2
-1
0
+1
+2
+3
Entanglement (Low β)

Dimensions are mixed. Notice how changing "Dimension 1" might affect both size and color. This makes the latent space harder to interpret.

Disentangled (High β)

Each dimension controls a single isolated feature. "Dimension 2" now only controls the Hue, while shape and size remain constant.


Hierarchical VAEs

Standard VAEs use a single level of latent variables. Hierarchical VAEs (e.g., NVAE, VDVAE) stack multiple layers of stochastic variables:

pθ(x,z1,,zL)=pθ(xz1)l=1Lpθ(zlzl+1)p_\theta(\mathbf{x}, \mathbf{z}_1, \dots, \mathbf{z}_L) = p_\theta(\mathbf{x} \mid \mathbf{z}_1) \prod_{l=1}^{L} p_\theta(\mathbf{z}_l \mid \mathbf{z}_{l+1})

Lower layers capture fine-grained local details; higher layers capture global structure. This hierarchy dramatically improves sample quality.


VQ-VAE: Discrete Latent Spaces

VQ-VAE (van den Oord et al., 2017) replaces the continuous Gaussian posterior with a discrete codebook. The encoder maps x\mathbf{x} to a sequence of indices into a learned embedding table; the decoder takes the corresponding embedding vectors.

The key insight: there is no KL term — the prior is a uniform categorical distribution. Instead, a separate autoregressive prior (e.g., a PixelCNN or Transformer) is trained on the sequence of codebook indices. This decouples representation learning from generation and produces sharper samples than a Gaussian decoder.

VQ-VAE is the foundation of DALL-E 1, MusicLM, and many audio tokenisation systems.


13. VAEs vs. Other Generative Models

PropertyVAENormalizing FlowAutoregressiveGANDiffusion
Exact log-likelihoodNo (ELBO)YesYesNoNo (bound)
Fast samplingYesYesNo (sequential)YesNo (steps)
Stable trainingYesYesYesNoYes
Structured latent spaceYesYes (exact)NoNoImplicit
Invertible encoderNoYesN/ANoNo
Sample quality (images)MediumLow–MediumMediumHighHighest

VAEs occupy a unique niche: they provide a learned, structured latent space with a meaningful encoder — something that flows, GANs, and diffusion models lack by default. This makes them the tool of choice when representation learning and inference (not just generation) matter.


Summary

ConceptKey Idea
Generative storyzN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), then xpθ(xz)\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})
Intractable posteriorpθ(zx)p_\theta(\mathbf{z} \mid \mathbf{x}) cannot be computed exactly — approximate with qϕq_\phi
ELBOEq[logpθ(xz)]DKL(qϕp)\mathbb{E}_q[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_\text{KL}(q_\phi \| p) — a tractable lower bound
Reconstruction termCross-entropy or MSE — pushes for faithful decoding
KL termClosed-form Gaussian expression — regularises the latent space
Reparameterisationz=μ+σϵ\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} — makes sampling differentiable
Posterior collapseEncoder ignored; mitigated by KL annealing or free bits
β-VAEHigher KL weight → disentangled latent dimensions
VQ-VAEDiscrete codebook + autoregressive prior → sharper samples

On this page