Variational Autoencoders
How VAEs learn structured latent spaces by combining neural networks with Bayesian inference — from the ELBO to the reparameterisation trick and beyond.
Variational Autoencoders
A Variational Autoencoder (VAE) is a generative model that learns a compressed, structured representation of data — called a latent space — and can generate new data by sampling from it. Introduced by Kingma & Welling (2013) and Rezende, Mohamed & Wierstra (2014), it remains one of the most influential ideas in deep generative modelling.
What makes a VAE different from a standard autoencoder is a single, powerful idea: instead of mapping each data point to a single latent vector, the encoder maps it to a probability distribution over the latent space. This forces the latent space to be continuous and structured — making it possible to generate new samples simply by sampling a latent code and decoding it.
1. Starting Point: The Standard Autoencoder
Before understanding what makes VAEs variational, it helps to understand a plain autoencoder.
An autoencoder has two parts:
- Encoder : compresses input into a low-dimensional code .
- Decoder : reconstructs from .
The training objective is simply reconstruction — minimise the difference between the input and its reconstruction.
The Problem with Plain Autoencoders
A standard autoencoder maps each data point to a single point in latent space. The latent codes are scattered across with no imposed structure. If you sample a random point from the latent space and decode it, the decoder has no idea what to do — it was never trained to handle points between the learned codes.
You cannot use a plain autoencoder as a generative model.
Standard Autoencoder
VAE (Variational)
2. The Generative View
A VAE starts from a probabilistic generative story:
- Sample a latent code from the prior:
- Decode it into data via the likelihood:
The model is parameterised by (the decoder network). We want to find that maximises the marginal likelihood of the data:
This integral is intractable — integrating over all possible latent codes is computationally impossible for high-dimensional .
Why Is This Integral Hard?
For a 100-dimensional latent space, the integral is over . There is no closed form, and numerical quadrature is exponentially expensive in dimension. The VAE sidesteps this with approximate posterior inference.
3. Approximate Inference: The Encoder
The key insight of VAEs: instead of integrating over all , introduce a learned approximate posterior — the encoder — that tries to match the true (intractable) posterior .
In practice, the encoder is a neural network that outputs the parameters of a diagonal Gaussian:
The encoder takes and outputs two vectors: a mean and a log-variance (log is used for numerical stability), both of dimension .
4. The Evidence Lower Bound (ELBO)
Since is intractable, we derive a tractable lower bound using Jensen's inequality.
Starting from the log-likelihood:
Multiply and divide by :
Apply Jensen's inequality ():
This lower bound is the ELBO (Evidence Lower BOund):
Training maximises with respect to both (decoder) and (encoder).
The Gap
The ELBO is always a lower bound on . The gap between them is exactly the KL divergence between the approximate and true posterior:
The better the encoder approximates the true posterior, the tighter the bound.
5. Dissecting the Two ELBO Terms
Reconstruction Term
This measures how well the decoder reconstructs the input from the latent code. Concretely:
- For binary data (e.g., binarised MNIST): is Bernoulli → reconstruction loss is binary cross-entropy.
- For continuous data (e.g., natural images): is Gaussian → reconstruction loss is mean squared error.
This term pushes the model to encode and decode data faithfully.
Regularisation Term (KL Divergence)
This measures how far the learned posterior is from the standard Gaussian prior .
For diagonal Gaussians, this term has a closed-form solution that requires no Monte Carlo estimation:
This term acts as a regulariser: it prevents the encoder from mapping all data points to extremely narrow, non-overlapping Gaussians (which would turn the VAE back into a plain autoencoder with no usable latent space).
Loss Tension (β-VAE)
6. The KL Divergence in Detail
It helps to build intuition for what is doing geometrically.
It is zero if and only if everywhere. For our diagonal Gaussian case:
| What KL penalises | Why |
|---|---|
| Large (far from origin) | Pulls all posterior means toward 0 |
| Very small (near-delta posterior) | Prevents the encoder from being over-confident |
| Very large | Prevents the posterior from being too diffuse |
The KL term therefore enforces that the approximate posterior stays close to the prior — a standard Gaussian centred at the origin. This is what creates the smooth, connected latent space that enables generation.
7. The Reparameterisation Trick
Here lies a critical technical challenge: the ELBO requires computing gradients with respect to (the encoder parameters), but the reconstruction term involves an expectation under — a distribution that itself depends on .
Sampling is not differentiable. You cannot backpropagate through a sampling operation directly.
The Solution
Write the sample as a deterministic function of and a noise variable that is independent of :
Now the randomness lives entirely in , which does not depend on . The gradient flows cleanly through and .
Node
8. The Full Training Pipeline
Putting it all together, one training step for a single data point :
- Encode — feed through the encoder network to get .
- Sample — draw and compute .
- Decode — feed through the decoder network to get the reconstruction .
- Compute ELBO — evaluate reconstruction loss + KL divergence.
- Backpropagate — gradients flow through , , , , back into encoder and decoder weights.
In minibatch training, the ELBO for a dataset is approximated as:
where is sampled once per data point per step using the reparameterisation trick.
9. The Latent Space
The latent space is the heart of the VAE. Because the KL term regularises all posterior distributions toward the standard Gaussian prior, the latent space tends to have three desirable properties:
Continuity
Nearby points in latent space decode to similar outputs. Small perturbations to produce small changes in .
Completeness
Every point sampled from decodes to a plausible output. The decoder is trained to handle the entire region near the origin, not just isolated point-codes.
Interpolation
Walking a straight line between two latent codes and produces a semantically meaningful interpolation between their corresponding data points.
10. Generation
To generate a new sample:
- Sample a latent code from the prior:
- Decode:
No encoder is needed at generation time. This is the key advantage over autoregressive models, where generation is sequential — a VAE decoder can generate an entire output in a single forward pass.
11. Posterior Collapse
A notorious failure mode of VAEs is posterior collapse: the encoder ignores the input and maps every to the prior .
When this happens:
- The KL term drops to zero (the encoder is "perfect" from the regulariser's perspective).
- The decoder learns to ignore entirely and generates data from the marginal distribution.
- The latent space carries no information about the input.
Why It Happens
The KL term always "wants" (zero KL is optimal). If the decoder is powerful enough (e.g., a Transformer or PixelCNN), it can achieve good reconstruction without any information from , making the encoder redundant.
Mitigations
| Technique | Mechanism |
|---|---|
| KL annealing | Start with β = 0, slowly increase to 1 during training |
| Free bits | Only penalise KL when it drops below a threshold per dimension |
| Weakened decoder | Deliberately limit decoder capacity so it must use |
| δ-VAE | Ensure a minimum KL per latent dimension |
The encoder separates inputs into distinct regions.
Decoder uses latent code to reconstruct specific details.
12. Beyond the Vanilla VAE
β-VAE: Disentanglement
A higher β forces the posterior to be even closer to the factorial prior , encouraging each latent dimension to encode an independent, disentangled factor of variation (shape, colour, orientation, etc.).
Latent Traversals & Disentanglement
Varying $z_j$ from -3 to +3 while fixing other dimensions at 0.
Dimensions are mixed. Notice how changing "Dimension 1" might affect both size and color. This makes the latent space harder to interpret.
Each dimension controls a single isolated feature. "Dimension 2" now only controls the Hue, while shape and size remain constant.
Hierarchical VAEs
Standard VAEs use a single level of latent variables. Hierarchical VAEs (e.g., NVAE, VDVAE) stack multiple layers of stochastic variables:
Lower layers capture fine-grained local details; higher layers capture global structure. This hierarchy dramatically improves sample quality.
VQ-VAE: Discrete Latent Spaces
VQ-VAE (van den Oord et al., 2017) replaces the continuous Gaussian posterior with a discrete codebook. The encoder maps to a sequence of indices into a learned embedding table; the decoder takes the corresponding embedding vectors.
The key insight: there is no KL term — the prior is a uniform categorical distribution. Instead, a separate autoregressive prior (e.g., a PixelCNN or Transformer) is trained on the sequence of codebook indices. This decouples representation learning from generation and produces sharper samples than a Gaussian decoder.
VQ-VAE is the foundation of DALL-E 1, MusicLM, and many audio tokenisation systems.
13. VAEs vs. Other Generative Models
| Property | VAE | Normalizing Flow | Autoregressive | GAN | Diffusion |
|---|---|---|---|---|---|
| Exact log-likelihood | No (ELBO) | Yes | Yes | No | No (bound) |
| Fast sampling | Yes | Yes | No (sequential) | Yes | No (steps) |
| Stable training | Yes | Yes | Yes | No | Yes |
| Structured latent space | Yes | Yes (exact) | No | No | Implicit |
| Invertible encoder | No | Yes | N/A | No | No |
| Sample quality (images) | Medium | Low–Medium | Medium | High | Highest |
VAEs occupy a unique niche: they provide a learned, structured latent space with a meaningful encoder — something that flows, GANs, and diffusion models lack by default. This makes them the tool of choice when representation learning and inference (not just generation) matter.
Summary
| Concept | Key Idea |
|---|---|
| Generative story | , then |
| Intractable posterior | cannot be computed exactly — approximate with |
| ELBO | — a tractable lower bound |
| Reconstruction term | Cross-entropy or MSE — pushes for faithful decoding |
| KL term | Closed-form Gaussian expression — regularises the latent space |
| Reparameterisation | — makes sampling differentiable |
| Posterior collapse | Encoder ignored; mitigated by KL annealing or free bits |
| β-VAE | Higher KL weight → disentangled latent dimensions |
| VQ-VAE | Discrete codebook + autoregressive prior → sharper samples |
Normalizing Flows
How invertible neural networks learn exact probability distributions — from the change-of-variables formula to modern coupling layers and continuous flows.
Generative Adversarial Networks
How two neural networks locked in competition learn to generate indistinguishably realistic data — from the minimax game to Wasserstein distance and StyleGAN.