How VAEs learn structured latent spaces by combining neural networks with Bayesian inference — from the ELBO to the reparameterisation trick and beyond.

Variational Autoencoders

A Variational Autoencoder (VAE) is a generative model that learns a compressed, structured representation of data — called a latent space — and can generate new data by sampling from it. Introduced by Kingma & Welling (2013) and Rezende, Mohamed & Wierstra (2014), it remains one of the most influential ideas in deep generative modelling.

What makes a VAE different from a standard autoencoder is a single, powerful idea: instead of mapping each data point to a single latent vector, the encoder maps it to a probability distribution over the latent space. This forces the latent space to be continuous and structured — making it possible to generate new samples simply by sampling a latent code and decoding it.

1. Starting Point: The Standard Autoencoder

Before understanding what makes VAEs variational, it helps to understand a plain autoencoder.

An autoencoder has two parts:

Encoder $q_\phi(\mathbf{z} \mid \mathbf{x})$ : compresses input $\mathbf{x}$ into a low-dimensional code $\mathbf{z}$ .
Decoder $p_\theta(\mathbf{x} \mid \mathbf{z})$ : reconstructs $\mathbf{x}$ from $\mathbf{z}$ .

The training objective is simply reconstruction — minimise the difference between the input and its reconstruction.

The Problem with Plain Autoencoders

A standard autoencoder maps each data point to a single point in latent space. The latent codes are scattered across $\mathbb{R}^d$ with no imposed structure. If you sample a random point $\mathbf{z}$ from the latent space and decode it, the decoder has no idea what to do — it was never trained to handle points between the learned codes.

You cannot use a plain autoencoder as a generative model.

Standard Autoencoder

Discrete codes: Gap = Garbage

VAE (Variational)

Continuous distributions: Overlap = Structure

2. The Generative View

A VAE starts from a probabilistic generative story:

Sample a latent code from the prior: $\mathbf{z} \sim p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$
Decode it into data via the likelihood: $\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})$

The model is parameterised by $\theta$ (the decoder network). We want to find $\theta$ that maximises the marginal likelihood of the data:

p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}

This integral is intractable — integrating over all possible latent codes is computationally impossible for high-dimensional $\mathbf{z}$ .

Why Is This Integral Hard?

For a 100-dimensional latent space, the integral is over $\mathbb{R}^{100}$ . There is no closed form, and numerical quadrature is exponentially expensive in dimension. The VAE sidesteps this with approximate posterior inference.

3. Approximate Inference: The Encoder

The key insight of VAEs: instead of integrating over all $\mathbf{z}$ , introduce a learned approximate posterior $q_\phi(\mathbf{z} \mid \mathbf{x})$ — the encoder — that tries to match the true (intractable) posterior $p_\theta(\mathbf{z} \mid \mathbf{x})$ .

In practice, the encoder is a neural network that outputs the parameters of a diagonal Gaussian:

q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\mathbf{z} \mid \boldsymbol{\mu}_\phi(\mathbf{x}),\, \text{diag}(\boldsymbol{\sigma}^2_\phi(\mathbf{x})))

The encoder takes $\mathbf{x}$ and outputs two vectors: a mean $\boldsymbol{\mu}$ and a log-variance $\log \boldsymbol{\sigma}^2$ (log is used for numerical stability), both of dimension $d_z$ .

Input (x)

Encode (μ, σ)

Sample (z)

Decode (x̂)

4. The Evidence Lower Bound (ELBO)

Since $\log p_\theta(\mathbf{x})$ is intractable, we derive a tractable lower bound using Jensen's inequality.

Starting from the log-likelihood:

\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z}

Multiply and divide by $q_\phi(\mathbf{z} \mid \mathbf{x})$ :

= \log \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \frac{p_\theta(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z})}{q_\phi(\mathbf{z} \mid \mathbf{x})} \right]

Apply Jensen's inequality ( $\log \mathbb{E}[Y] \geq \mathbb{E}[\log Y]$ ):

\geq \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right] - D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)

This lower bound is the ELBO (Evidence Lower BOund):

\boxed{\mathcal{L}(\theta, \phi;\, \mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right]}_{\text{Reconstruction term}} - \underbrace{D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)}_{\text{Regularisation term}}}

Training maximises $\mathcal{L}$ with respect to both $\theta$ (decoder) and $\phi$ (encoder).

The Gap

The ELBO is always a lower bound on $\log p_\theta(\mathbf{x})$ . The gap between them is exactly the KL divergence between the approximate and true posterior:

\log p_\theta(\mathbf{x}) = \mathcal{L}(\theta, \phi;\, \mathbf{x}) + D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p_\theta(\mathbf{z} \mid \mathbf{x}) \right)

The better the encoder approximates the true posterior, the tighter the bound.

5. Dissecting the Two ELBO Terms

Reconstruction Term

\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x} \mid \mathbf{z}) \right]

This measures how well the decoder reconstructs the input from the latent code. Concretely:

For binary data (e.g., binarised MNIST): $p_\theta(\mathbf{x} \mid \mathbf{z})$ is Bernoulli → reconstruction loss is binary cross-entropy.
For continuous data (e.g., natural images): $p_\theta(\mathbf{x} \mid \mathbf{z})$ is Gaussian → reconstruction loss is mean squared error.

This term pushes the model to encode and decode data faithfully.

Regularisation Term (KL Divergence)

D_\text{KL}\!\left( q_\phi(\mathbf{z} \mid \mathbf{x}) \,\|\, p(\mathbf{z}) \right)

This measures how far the learned posterior is from the standard Gaussian prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ .

For diagonal Gaussians, this term has a closed-form solution that requires no Monte Carlo estimation:

D_\text{KL}\!\left( \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2) \,\|\, \mathcal{N}(\mathbf{0}, \mathbf{I}) \right) = \frac{1}{2} \sum_{j=1}^{d_z} \left( \mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right)

This term acts as a regulariser: it prevents the encoder from mapping all data points to extremely narrow, non-overlapping Gaussians (which would turn the VAE back into a plain autoencoder with no usable latent space).

Loss Tension (β-VAE)

β = 1.0

Low β: Prioritizes detail, messy latent space.

High β: Prioritizes structure, blurry reconstructions.

6. The KL Divergence in Detail

It helps to build intuition for what $D_\text{KL}(q \| p)$ is doing geometrically.

D_\text{KL}(q \| p) = \mathbb{E}_q \left[ \log \frac{q(\mathbf{z})}{p(\mathbf{z})} \right] \geq 0

It is zero if and only if $q = p$ everywhere. For our diagonal Gaussian case:

What KL penalises	Why
Large $\mu_j$ (far from origin)	Pulls all posterior means toward 0
Very small $\sigma_j$ (near-delta posterior)	Prevents the encoder from being over-confident
Very large $\sigma_j$	Prevents the posterior from being too diffuse

The KL term therefore enforces that the approximate posterior stays close to the prior — a standard Gaussian centred at the origin. This is what creates the smooth, connected latent space that enables generation.

KL Divergence Score

0.0000

Mean (μ): 0.0

Std Dev (σ): 1.0

7. The Reparameterisation Trick

Here lies a critical technical challenge: the ELBO requires computing gradients with respect to $\phi$ (the encoder parameters), but the reconstruction term involves an expectation under $q_\phi(\mathbf{z} \mid \mathbf{x})$ — a distribution that itself depends on $\phi$ .

Sampling is not differentiable. You cannot backpropagate through a sampling operation directly.

The Solution

Write the sample $\mathbf{z}$ as a deterministic function of $\phi$ and a noise variable $\boldsymbol{\epsilon}$ that is independent of $\phi$ :

\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Now the randomness lives entirely in $\boldsymbol{\epsilon}$ , which does not depend on $\phi$ . The gradient $\frac{\partial \mathbf{z}}{\partial \phi}$ flows cleanly through $\boldsymbol{\mu}_\phi$ and $\boldsymbol{\sigma}_\phi$ .

\underbrace{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})}_{\text{not differentiable}} \quad \Longrightarrow \quad \underbrace{\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}}_{\text{differentiable w.r.t. } \phi}

Gradient Blocked

Encoder (φ)

Sampling
Node

⚡

Decoder (θ)

Reparameterised

μ, σ (φ)

Decoder (θ)

Static Noise

8. The Full Training Pipeline

Putting it all together, one training step for a single data point $\mathbf{x}$ :

Encode — feed $\mathbf{x}$ through the encoder network to get $\boldsymbol{\mu}_\phi, \boldsymbol{\sigma}_\phi$ .
Sample — draw $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and compute $\mathbf{z} = \boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\epsilon}$ .
Decode — feed $\mathbf{z}$ through the decoder network to get the reconstruction $\hat{\mathbf{x}}$ .
Compute ELBO — evaluate reconstruction loss + KL divergence.
Backpropagate — gradients flow through $\hat{\mathbf{x}}$ , $\mathbf{z}$ , $\boldsymbol{\mu}$ , $\boldsymbol{\sigma}$ , back into encoder and decoder weights.

In minibatch training, the ELBO for a dataset $\mathcal{D}$ is approximated as:

\mathcal{L} \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \underbrace{\log p_\theta(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i)})}_{\text{reconstruction}} - \underbrace{D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}^{(i)}) \| p(\mathbf{z}))}_{\text{closed form}} \right]

where $\mathbf{z}^{(i)}$ is sampled once per data point per step using the reparameterisation trick.

9. The Latent Space

The latent space is the heart of the VAE. Because the KL term regularises all posterior distributions toward the standard Gaussian prior, the latent space tends to have three desirable properties:

Continuity

Nearby points in latent space decode to similar outputs. Small perturbations to $\mathbf{z}$ produce small changes in $\hat{\mathbf{x}}$ .

Completeness

Every point sampled from $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ decodes to a plausible output. The decoder is trained to handle the entire region near the origin, not just isolated point-codes.

Interpolation

Walking a straight line between two latent codes $\mathbf{z}_A$ and $\mathbf{z}_B$ produces a semantically meaningful interpolation between their corresponding data points.

\mathbf{z}(\lambda) = (1 - \lambda) \mathbf{z}_A + \lambda \mathbf{z}_B, \quad \lambda \in [0, 1]

2D Latent Space (z)

z = [0.00, 0.00]

Decoded Output (x̂)

Move your mouse to traverse the manifold and see the samples morph.

10. Generation

To generate a new sample:

Sample a latent code from the prior: $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Decode: $\hat{\mathbf{x}} = \text{decoder}_\theta(\mathbf{z})$

No encoder is needed at generation time. This is the key advantage over autoregressive models, where generation is sequential — a VAE decoder can generate an entire output in a single forward pass.

z_A

Semantic Walk

z_B

11. Posterior Collapse

A notorious failure mode of VAEs is posterior collapse: the encoder ignores the input and maps every $\mathbf{x}$ to the prior $q_\phi(\mathbf{z} \mid \mathbf{x}) \approx p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ .

When this happens:

The KL term drops to zero (the encoder is "perfect" from the regulariser's perspective).
The decoder learns to ignore $\mathbf{z}$ entirely and generates data from the marginal distribution.
The latent space carries no information about the input.

Why It Happens

The KL term always "wants" $q = p$ (zero KL is optimal). If the decoder is powerful enough (e.g., a Transformer or PixelCNN), it can achieve good reconstruction without any information from $\mathbf{z}$ , making the encoder redundant.

Mitigations

Technique	Mechanism
KL annealing	Start with β = 0, slowly increase to 1 during training
Free bits	Only penalise KL when it drops below a threshold $\lambda$ per dimension
Weakened decoder	Deliberately limit decoder capacity so it must use $\mathbf{z}$
δ-VAE	Ensure a minimum KL per latent dimension

Latent Encodings

The encoder separates inputs into distinct regions.

Reconstructions

Decoder uses latent code to reconstruct specific details.

12. Beyond the Vanilla VAE

β-VAE: Disentanglement

\mathcal{L}_\beta = \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - \beta \cdot D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z})), \quad \beta > 1

A higher β forces the posterior to be even closer to the factorial prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$ , encouraging each latent dimension to encode an independent, disentangled factor of variation (shape, colour, orientation, etc.).

Latent Traversals & Disentanglement

Varying $z_j$ from -3 to +3 while fixing other dimensions at 0.

Pressure (β)1 (Vanilla)

EntangledDisentangled

Dimension 1

-3

-2

-1

Dimension 2

-3

-2

-1

Dimension 3

-3

-2

-1

Entanglement (Low β)

Dimensions are mixed. Notice how changing "Dimension 1" might affect both size and color. This makes the latent space harder to interpret.

Disentangled (High β)

Each dimension controls a single isolated feature. "Dimension 2" now only controls the Hue, while shape and size remain constant.

Hierarchical VAEs

Standard VAEs use a single level of latent variables. Hierarchical VAEs (e.g., NVAE, VDVAE) stack multiple layers of stochastic variables:

p_\theta(\mathbf{x}, \mathbf{z}_1, \dots, \mathbf{z}_L) = p_\theta(\mathbf{x} \mid \mathbf{z}_1) \prod_{l=1}^{L} p_\theta(\mathbf{z}_l \mid \mathbf{z}_{l+1})

Lower layers capture fine-grained local details; higher layers capture global structure. This hierarchy dramatically improves sample quality.

VQ-VAE: Discrete Latent Spaces

VQ-VAE (van den Oord et al., 2017) replaces the continuous Gaussian posterior with a discrete codebook. The encoder maps $\mathbf{x}$ to a sequence of indices into a learned embedding table; the decoder takes the corresponding embedding vectors.

The key insight: there is no KL term — the prior is a uniform categorical distribution. Instead, a separate autoregressive prior (e.g., a PixelCNN or Transformer) is trained on the sequence of codebook indices. This decouples representation learning from generation and produces sharper samples than a Gaussian decoder.

VQ-VAE is the foundation of DALL-E 1, MusicLM, and many audio tokenisation systems.

13. VAEs vs. Other Generative Models

Property	VAE	Normalizing Flow	Autoregressive	GAN	Diffusion
Exact log-likelihood	No (ELBO)	Yes	Yes	No	No (bound)
Fast sampling	Yes	Yes	No (sequential)	Yes	No (steps)
Stable training	Yes	Yes	Yes	No	Yes
Structured latent space	Yes	Yes (exact)	No	No	Implicit
Invertible encoder	No	Yes	N/A	No	No
Sample quality (images)	Medium	Low–Medium	Medium	High	Highest

VAEs occupy a unique niche: they provide a learned, structured latent space with a meaningful encoder — something that flows, GANs, and diffusion models lack by default. This makes them the tool of choice when representation learning and inference (not just generation) matter.

Summary

Concept	Key Idea
Generative story	$\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ , then $\mathbf{x} \sim p_\theta(\mathbf{x} \mid \mathbf{z})$
Intractable posterior	$p_\theta(\mathbf{z} \mid \mathbf{x})$ cannot be computed exactly — approximate with $q_\phi$
ELBO	$\mathbb{E}_q[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_\text{KL}(q_\phi \\| p)$ — a tractable lower bound
Reconstruction term	Cross-entropy or MSE — pushes for faithful decoding
KL term	Closed-form Gaussian expression — regularises the latent space
Reparameterisation	$\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ — makes sampling differentiable
Posterior collapse	Encoder ignored; mitigated by KL annealing or free bits
β-VAE	Higher KL weight → disentangled latent dimensions
VQ-VAE	Discrete codebook + autoregressive prior → sharper samples

Variational Autoencoders

Standard Autoencoder

VAE (Variational)

Loss Tension (β-VAE)

Latent Traversals & Disentanglement

On this page