Normalizing Flows
How invertible neural networks learn exact probability distributions — from the change-of-variables formula to modern coupling layers and continuous flows.
Normalizing Flows
Imagine you have a simple, well-understood probability distribution — a standard Gaussian — and you want to transform it into something far more complex: the distribution of natural images, or the distribution of English sentences. Normalizing flows do exactly this, through a chain of invertible, differentiable transformations.
What makes flows special is their exactness. Unlike VAEs (which optimise a lower bound) or GANs (which use an implicit adversarial objective), normalizing flows give you the exact log-likelihood of any data point. You can both generate samples and evaluate precise probabilities — in both directions.
1. The Core Intuition
Start with a simple "base" distribution — typically an isotropic Gaussian — and apply a sequence of invertible transformations to produce samples in data space:
Because each is invertible, you can run the chain in reverse: given a data point , recover the latent code exactly.
The word normalizing refers to this reverse direction — mapping complex data back to a "normal" (Gaussian) form. The word flow refers to the sequence of transformations.
Flow Transformation
Flow Complexity
Invertible steps transform a Gaussian base into a target shape. Each step maps points deterministically.
2. The Change-of-Variables Formula
The mathematical engine behind flows is the change-of-variables formula from calculus. If and is a differentiable bijection, then the probability densities are related by:
Or equivalently, writing :
| Term | Meaning |
|---|---|
| Likelihood of the latent code under the base distribution | |
| Jacobian matrix of the inverse transformation | |
| Volume scaling factor — how much the transformation stretches or squishes space |
The Volume Correction
Think of probability mass like water. If a transformation compresses a region of space, the density increases there (more water in a smaller volume). If it expands a region, density decreases. The Jacobian determinant is precisely this volume-change correction factor.
For a Chain of Flows
When transformations are composed, the log-likelihood accumulates one log-Jacobian term per layer:
where , , and .
This is why the transformations must be invertible: we need to map data back to the base space and evaluate .
Base Space (z)
Warped Data Space (x)
Volume Scaling
Transform Function:
f(z₁, z₂) = (z₁, z₂ · e^(0.8z₁ + 0.5))
Jacobian determinant:
det J = e^(0.8z₁ + 0.5)
3. The Two Requirements for a Flow Layer
Every building block in a normalizing flow must satisfy exactly two constraints:
- Invertible: Given , we must be able to recover efficiently.
- Tractable Jacobian: We must be able to compute efficiently — ideally in rather than (the cost of a naive determinant for a matrix).
These two constraints are what drive all of the architectural innovations in normalizing flows. Different flow families make different design choices to satisfy them cheaply.
4. Planar and Radial Flows (Beginner Building Blocks)
The earliest flows (Rezende & Mohamed, 2015) used simple parameterisations.
Planar Flow
This pushes the distribution along a single hyperplane defined by . The Jacobian determinant has a closed form via the matrix determinant lemma:
Radial Flow
Contracts or expands the distribution around a reference point .
Limitation
Planar and radial flows are expressive in low dimensions but scale poorly — stacking many layers is required to approximate complex distributions in high dimensions.
5. Affine Coupling Layers (RealNVP)
The affine coupling layer (Dinh et al., 2017 — RealNVP) is the workhorse of practical normalizing flows. It achieves both invertibility and a cheap Jacobian by cleverly splitting the input dimensions.
The Split-and-Transform Trick
Partition into two halves: .
Forward pass (generation direction ):
Inverse pass (inference direction ):
where and are arbitrary neural networks (scale and translation functions).
Why the Jacobian Is Cheap
Because (the first half passes through unchanged), the Jacobian matrix is lower triangular. The determinant of a triangular matrix is just the product of its diagonal entries:
This is — no matrix operations needed. The expressive power comes entirely from and , which can be deep networks.
Affine Coupling Layer
Forward Pass (z → y)
Input (z)
Output (y)
Step 1: Input vector z is partitioned into two halves.
Jacobian Determinant
log|det J| = 0
6. Forward vs. Inverse Direction
A critical distinction in flow models is which direction is cheap to compute:
| Generation | Inference / Density | |
|---|---|---|
| RealNVP (coupling) | ||
| MAF (autoregressive) | sequential | parallel |
| IAF (inverse autoregressive) | parallel | sequential |
This trade-off is fundamental. Models optimised for fast density evaluation (MAF) are slow at sampling; models optimised for fast sampling (IAF) are slow at density evaluation.
Latent z
[0.200, -0.300]
Data x
[0.200, -0.018]
7. Autoregressive Flows
Autoregressive flows (Masked Autoregressive Flow — MAF, Papamakarios et al., 2017) apply the same chain-rule factorisation used in autoregressive generative models, but in the context of a flow.
The Autoregressive Structure
Each dimension is an affine function of the corresponding latent variable , with scale and shift conditioned on all previous dimensions.
The Jacobian
Because depends only on and earlier s (not later ones), the Jacobian is triangular:
Masking for Parallelism: MADE
In practice, the autoregressive conditioners are implemented using a single neural network with masked weights (MADE) — binary masks that zero out connections from to whenever . This enforces the autoregressive constraint while allowing a single forward pass to compute all conditioners simultaneously.
Sequential Sampling
: complexity
Dependency Mask
Sampling is sequential. Each must wait for to be computed before it can be generated.
8. Stacking Flow Layers
A single flow layer has limited capacity. Power comes from composing many layers. The log-likelihood simply accumulates:
Between coupling layers, flows typically apply permutations or 1×1 convolutions (Glow, Kingma & Dhariwal 2018) to mix the dimensions that were held fixed by the previous layer — otherwise, the same dimensions would always be left unchanged.
The Glow Architecture (Image Flows)
Glow extended RealNVP to images with three key additions:
- Actnorm — a data-dependent batch normalisation alternative stable in small batches.
- Invertible 1×1 convolution — a learned permutation parameterised as an invertible matrix, replacing fixed channel shuffles.
- Multi-scale architecture — after every few flow steps, half the channels are factored out directly (split off to a Gaussian), reducing computational cost at deeper layers.
Composition Stack
Glow Block
Actnorm → 1x1 Conv → Coupling
log|det J|
+1.025
Glow Block
Actnorm → 1x1 Conv → Coupling
log|det J|
+1.090
Glow Block
Actnorm → 1x1 Conv → Coupling
log|det J|
+0.992
Glow Block
Actnorm → 1x1 Conv → Coupling
log|det J|
+0.667
Global Density
Total Log-Likelihood
Layer Contributions
9. Continuous Normalizing Flows
Continuous Normalizing Flows (CNFs, Chen et al., 2018) take the limit of infinitely many infinitesimally thin flow layers. Rather than a discrete sequence of transformations, the latent variable evolves according to an ordinary differential equation (ODE):
Starting from , we solve this ODE forward in time to reach .
The Continuous Change-of-Variables Formula
The log-density evolves according to the instantaneous change-of-variables formula (Liouville's theorem):
The Jacobian trace (sum of diagonal entries) replaces the full Jacobian determinant, which can be estimated cheaply using the Hutchinson estimator:
Flow Matching: Training Without ODE Solvers
A major recent advance (Flow Matching, Lipman et al., 2022; Liu et al., 2022) sidesteps the expensive ODE integration during training entirely. Instead of maximising likelihood, the model is trained to directly regress a target vector field that transports the base distribution to the data distribution:
This is a simple regression loss — no likelihood, no Jacobian, just matching a vector field. Stability Diffusion 3 and many modern image models use flow matching.
CNF vs. Discrete Flows
CNFs are more flexible (the ODE can take any path through space) but slower at training and inference due to ODE integration. Discrete flows are faster but constrained to a fixed architecture. Flow matching closes this gap by training CNFs with a simple, stable regression objective.
10. Training Objective
For discrete flows, training maximises the exact log-likelihood over a dataset :
where is the inference direction (data → latent). Everything is differentiable with respect to , so standard gradient ascent applies.
Why Exact Likelihood Matters
| Model | Likelihood | What's optimised |
|---|---|---|
| VAE | Lower bound (ELBO) | |
| GAN | Implicit — not evaluated | Adversarial minimax |
| Diffusion | Weighted lower bound | |
| Flow | Exact | True log-likelihood |
Exact likelihood enables direct comparison between models (in bits-per-dimension), reliable density estimation, and use as a prior in larger systems.
11. Applications
Density Estimation
Flows directly answer: "How likely is this data point?" This is valuable for anomaly detection (flag samples with very low ) and scientific applications where calibrated probabilities matter.
Generative Modelling
Sample , run the forward pass . Used in image generation (Glow), audio synthesis (WaveGlow), and molecular design.
Variational Inference
Use a flow as the variational posterior in a VAE. The flow makes more expressive than a diagonal Gaussian without losing tractability — a technique called normalizing flows for variational inference.
Latent Space Manipulation
Because the mapping between and is exact and invertible, you can encode a real image to , manipulate (interpolate, add attribute vectors), then decode back. There is no approximation error unlike VAE-based approaches.
12. Comparison with Other Generative Models
| Property | Autoregressive | VAE | GAN | Diffusion | Flow |
|---|---|---|---|---|---|
| Exact log-likelihood | Yes | No (ELBO) | No | No (bound) | Yes |
| Fast sampling | No (sequential) | Yes | Yes | No (steps) | Yes |
| Stable training | Yes | Yes | No | Yes | Yes |
| Latent space | Implicit | Approx. | Implicit | Noisy | Exact |
| Memory cost | Low | Low | Low | Medium | High |
| Best known use | Text (GPT) | Representation | Images | Images | Density |
The main cost of flows is memory and compute: storing the full invertible network with its Jacobian terms is expensive. This is why flows have been largely supplanted by diffusion models for image generation, where scalability matters more than exact likelihood.
Summary
| Concept | Key Idea |
|---|---|
| Change-of-variables | |
| Coupling layer | Split dimensions; transform half with an unconstrained network; Jacobian is triangular |
| Autoregressive flow | Each dimension conditioned on previous; parallel inference, sequential sampling |
| Continuous flow | ODE dynamics; log-density from Jacobian trace instead of determinant |
| Flow matching | Regress a target vector field — no ODE solve during training |
| Exact likelihood | True , not a bound — enables density estimation and model comparison |
Autoregressive Models
How modern language models generate sequences one token at a time — from the chain rule of probability to sampling strategies and causal attention.
Variational Autoencoders
How VAEs learn structured latent spaces by combining neural networks with Bayesian inference — from the ELBO to the reparameterisation trick and beyond.