My App
Generative Models

Normalizing Flows

How invertible neural networks learn exact probability distributions — from the change-of-variables formula to modern coupling layers and continuous flows.

Normalizing Flows

Imagine you have a simple, well-understood probability distribution — a standard Gaussian — and you want to transform it into something far more complex: the distribution of natural images, or the distribution of English sentences. Normalizing flows do exactly this, through a chain of invertible, differentiable transformations.

What makes flows special is their exactness. Unlike VAEs (which optimise a lower bound) or GANs (which use an implicit adversarial objective), normalizing flows give you the exact log-likelihood of any data point. You can both generate samples and evaluate precise probabilities — in both directions.


1. The Core Intuition

Start with a simple "base" distribution — typically an isotropic Gaussian pZ(z)p_Z(\mathbf{z}) — and apply a sequence of invertible transformations f1,f2,,fKf_1, f_2, \dots, f_K to produce samples in data space:

zpZ(z),x=fKfK1f1(z)\mathbf{z} \sim p_Z(\mathbf{z}), \qquad \mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})

Because each fkf_k is invertible, you can run the chain in reverse: given a data point x\mathbf{x}, recover the latent code z\mathbf{z} exactly.

The word normalizing refers to this reverse direction — mapping complex data back to a "normal" (Gaussian) form. The word flow refers to the sequence of transformations.

Flow Transformation

z ~ N(0, I)

Flow Complexity

Invertible steps transform a Gaussian base into a target shape. Each step maps points deterministically.

Step 0

2. The Change-of-Variables Formula

The mathematical engine behind flows is the change-of-variables formula from calculus. If x=f(z)\mathbf{x} = f(\mathbf{z}) and ff is a differentiable bijection, then the probability densities are related by:

pX(x)=pZ(f1(x))detf1xp_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right|

Or equivalently, writing z=f1(x)=g(x)\mathbf{z} = f^{-1}(\mathbf{x}) = g(\mathbf{x}):

logpX(x)=logpZ(g(x))+logdetgx\log p_X(\mathbf{x}) = \log p_Z(g(\mathbf{x})) + \log \left| \det \frac{\partial g}{\partial \mathbf{x}} \right|
TermMeaning
pZ(g(x))p_Z(g(\mathbf{x}))Likelihood of the latent code under the base distribution
gx\frac{\partial g}{\partial \mathbf{x}}Jacobian matrix of the inverse transformation
detJ\left\lvert \det J \right\rvertVolume scaling factor — how much the transformation stretches or squishes space

The Volume Correction

Think of probability mass like water. If a transformation compresses a region of space, the density increases there (more water in a smaller volume). If it expands a region, density decreases. The Jacobian determinant is precisely this volume-change correction factor.


For a Chain of Flows

When KK transformations are composed, the log-likelihood accumulates one log-Jacobian term per layer:

logpX(x)=logpZ(z)+k=1Klogdetfk1xk\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) + \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k^{-1}}{\partial \mathbf{x}_k} \right|

where x0=z\mathbf{x}_0 = \mathbf{z}, xk=fk(xk1)\mathbf{x}_k = f_k(\mathbf{x}_{k-1}), and xK=x\mathbf{x}_K = \mathbf{x}.

This is why the transformations must be invertible: we need g=f1g = f^{-1} to map data back to the base space and evaluate pZp_Z.

Base Space (z)

Warped Data Space (x)

Volume Scaling

Transform Function:

f(z₁, z₂) = (z₁, z₂ · e^(0.8z₁ + 0.5))

Jacobian determinant:

det J = e^(0.8z₁ + 0.5)

Click grid to evaluate

3. The Two Requirements for a Flow Layer

Every building block in a normalizing flow must satisfy exactly two constraints:

  1. Invertible: Given x=f(z)\mathbf{x} = f(\mathbf{z}), we must be able to recover z=f1(x)\mathbf{z} = f^{-1}(\mathbf{x}) efficiently.
  2. Tractable Jacobian: We must be able to compute logdetJ\log|\det J| efficiently — ideally in O(d)O(d) rather than O(d3)O(d^3) (the cost of a naive determinant for a d×dd \times d matrix).

These two constraints are what drive all of the architectural innovations in normalizing flows. Different flow families make different design choices to satisfy them cheaply.


4. Planar and Radial Flows (Beginner Building Blocks)

The earliest flows (Rezende & Mohamed, 2015) used simple parameterisations.

Planar Flow

f(z)=z+uh(wTz+b)f(\mathbf{z}) = \mathbf{z} + \mathbf{u} \cdot h(\mathbf{w}^T \mathbf{z} + b)

This pushes the distribution along a single hyperplane defined by w\mathbf{w}. The Jacobian determinant has a closed form via the matrix determinant lemma:

detfz=1+uTh(wTz+b)w\left| \det \frac{\partial f}{\partial \mathbf{z}} \right| = \left| 1 + \mathbf{u}^T h'(\mathbf{w}^T \mathbf{z} + b)\mathbf{w} \right|

Radial Flow

f(z)=z+βzz0α+zz0f(\mathbf{z}) = \mathbf{z} + \beta \frac{\mathbf{z} - \mathbf{z}_0}{\alpha + \|\mathbf{z} - \mathbf{z}_0\|}

Contracts or expands the distribution around a reference point z0\mathbf{z}_0.

Limitation

Planar and radial flows are expressive in low dimensions but scale poorly — stacking many layers is required to approximate complex distributions in high dimensions.


5. Affine Coupling Layers (RealNVP)

The affine coupling layer (Dinh et al., 2017 — RealNVP) is the workhorse of practical normalizing flows. It achieves both invertibility and a cheap Jacobian by cleverly splitting the input dimensions.

The Split-and-Transform Trick

Partition xRd\mathbf{x} \in \mathbb{R}^d into two halves: x=[x1:m,xm+1:d]\mathbf{x} = [\mathbf{x}_{1:m}, \, \mathbf{x}_{m+1:d}].

Forward pass (generation direction zx\mathbf{z} \to \mathbf{x}):

y1:m=z1:m\mathbf{y}_{1:m} = \mathbf{z}_{1:m} ym+1:d=zm+1:dexp(s(z1:m))+t(z1:m)\mathbf{y}_{m+1:d} = \mathbf{z}_{m+1:d} \odot \exp(s(\mathbf{z}_{1:m})) + t(\mathbf{z}_{1:m})

Inverse pass (inference direction xz\mathbf{x} \to \mathbf{z}):

z1:m=y1:m\mathbf{z}_{1:m} = \mathbf{y}_{1:m} zm+1:d=(ym+1:dt(y1:m))exp(s(y1:m))\mathbf{z}_{m+1:d} = (\mathbf{y}_{m+1:d} - t(\mathbf{y}_{1:m})) \odot \exp(-s(\mathbf{y}_{1:m}))

where s()s(\cdot) and t()t(\cdot) are arbitrary neural networks (scale and translation functions).

Why the Jacobian Is Cheap

Because y1:m=z1:m\mathbf{y}_{1:m} = \mathbf{z}_{1:m} (the first half passes through unchanged), the Jacobian matrix is lower triangular. The determinant of a triangular matrix is just the product of its diagonal entries:

logdetJ=j=m+1dsj(z1:m)\log \left| \det J \right| = \sum_{j=m+1}^{d} s_j(\mathbf{z}_{1:m})

This is O(d)O(d) — no matrix operations needed. The expressive power comes entirely from ss and tt, which can be deep networks.

Affine Coupling Layer

Forward Pass (z → y)

Input (z)

z₁:m
zₘ₊₁:d
Networks
s(z₁:m)
t(z₁:m)

Output (y)

y₁:m
yₘ₊₁:d

Step 1: Input vector z is partitioned into two halves.

Jacobian Determinant

log|det J| = 0


6. Forward vs. Inverse Direction

A critical distinction in flow models is which direction is cheap to compute:

Generation zx\mathbf{z} \to \mathbf{x}Inference / Density xz\mathbf{x} \to \mathbf{z}
RealNVP (coupling)O(d)O(d)O(d)O(d)
MAF (autoregressive)O(d)O(d) sequentialO(1)O(1) parallel
IAF (inverse autoregressive)O(1)O(1) parallelO(d)O(d) sequential

This trade-off is fundamental. Models optimised for fast density evaluation (MAF) are slow at sampling; models optimised for fast sampling (IAF) are slow at density evaluation.

Latent Space (z)
Data Space (x)

Latent z

[0.200, -0.300]

Data x

[0.200, -0.018]


7. Autoregressive Flows

Autoregressive flows (Masked Autoregressive Flow — MAF, Papamakarios et al., 2017) apply the same chain-rule factorisation used in autoregressive generative models, but in the context of a flow.

The Autoregressive Structure

xi=ziexp(αi(x1:i1))+μi(x1:i1),i=1,,dx_i = z_i \cdot \exp(\alpha_i(x_{1:i-1})) + \mu_i(x_{1:i-1}), \quad i = 1, \dots, d

Each dimension xix_i is an affine function of the corresponding latent variable ziz_i, with scale αi\alpha_i and shift μi\mu_i conditioned on all previous dimensions.

The Jacobian

Because xix_i depends only on ziz_i and earlier xxs (not later ones), the Jacobian xz\frac{\partial \mathbf{x}}{\partial \mathbf{z}} is triangular:

logdetJ=i=1dαi(x1:i1)\log \left| \det J \right| = \sum_{i=1}^{d} \alpha_i(x_{1:i-1})

Masking for Parallelism: MADE

In practice, the autoregressive conditioners αi,μi\alpha_i, \mu_i are implemented using a single neural network with masked weights (MADE) — binary masks that zero out connections from xjx_j to xix_i whenever jij \geq i. This enforces the autoregressive constraint while allowing a single forward pass to compute all conditioners simultaneously.

Sequential Sampling

zxz \to x: O(d)\mathcal{O}(d) complexity

Latent Base
z1
z2
z3
z4
z5
z6
MADE (Masked Autoencoder)
x1
x2
x3
x4
x5
x6
Data Space

Dependency Mask

z
z
z
z
z
z

Sampling is sequential. Each xix_i must wait for x1:i1x_{1:i-1} to be computed before it can be generated.


8. Stacking Flow Layers

A single flow layer has limited capacity. Power comes from composing many layers. The log-likelihood simply accumulates:

logpX(x)=logpZ(z0)+k=1KlogdetJk\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}_0) + \sum_{k=1}^{K} \log \left| \det J_k \right|

Between coupling layers, flows typically apply permutations or 1×1 convolutions (Glow, Kingma & Dhariwal 2018) to mix the dimensions that were held fixed by the previous layer — otherwise, the same dimensions would always be left unchanged.

The Glow Architecture (Image Flows)

Glow extended RealNVP to images with three key additions:

  1. Actnorm — a data-dependent batch normalisation alternative stable in small batches.
  2. Invertible 1×1 convolution — a learned permutation parameterised as an invertible matrix, replacing fixed channel shuffles.
  3. Multi-scale architecture — after every few flow steps, half the channels are factored out directly (split off to a Gaussian), reducing computational cost at deeper layers.

Composition Stack

BASE DISTRIBUTION (z)
L1

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+1.025

L2

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+1.090

L3

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+0.992

L4

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+0.667

DATA SPACE (x)
Global Density

Total Log-Likelihood

3.77

Layer Contributions

14 Layers8

9. Continuous Normalizing Flows

Continuous Normalizing Flows (CNFs, Chen et al., 2018) take the limit of infinitely many infinitesimally thin flow layers. Rather than a discrete sequence of transformations, the latent variable evolves according to an ordinary differential equation (ODE):

dz(t)dt=fθ(z(t),t),t[0,1]\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t), \qquad t \in [0, 1]

Starting from z(0)pZ\mathbf{z}(0) \sim p_Z, we solve this ODE forward in time to reach x=z(1)\mathbf{x} = \mathbf{z}(1).

The Continuous Change-of-Variables Formula

The log-density evolves according to the instantaneous change-of-variables formula (Liouville's theorem):

dlogp(z(t))dt=tr ⁣(fθz(t))\frac{d \log p(\mathbf{z}(t))}{dt} = -\text{tr}\!\left(\frac{\partial f_\theta}{\partial \mathbf{z}(t)}\right)

The Jacobian trace (sum of diagonal entries) replaces the full Jacobian determinant, which can be estimated cheaply using the Hutchinson estimator:

tr(J)ϵTJϵ,ϵN(0,I)\text{tr}(J) \approx \mathbf{\epsilon}^T J \mathbf{\epsilon}, \qquad \mathbf{\epsilon} \sim \mathcal{N}(0, I)

Flow Matching: Training Without ODE Solvers

A major recent advance (Flow Matching, Lipman et al., 2022; Liu et al., 2022) sidesteps the expensive ODE integration during training entirely. Instead of maximising likelihood, the model is trained to directly regress a target vector field that transports the base distribution to the data distribution:

LFM(θ)=Et,x0,x1fθ(xt,t)ut(xtx0,x1)2\mathcal{L}_\text{FM}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_1} \left\| f_\theta(\mathbf{x}_t, t) - u_t(\mathbf{x}_t \mid \mathbf{x}_0, \mathbf{x}_1) \right\|^2

This is a simple regression loss — no likelihood, no Jacobian, just matching a vector field. Stability Diffusion 3 and many modern image models use flow matching.

CNF vs. Discrete Flows

CNFs are more flexible (the ODE can take any path through space) but slower at training and inference due to ODE integration. Discrete flows are faster but constrained to a fixed architecture. Flow matching closes this gap by training CNFs with a simple, stable regression objective.


10. Training Objective

For discrete flows, training maximises the exact log-likelihood over a dataset D={x(i)}\mathcal{D} = \{\mathbf{x}^{(i)}\}:

L(θ)=1Ni=1NlogpX(x(i);θ)=1Ni=1N[logpZ(g(x(i)))+logdetgx(i)]\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log p_X(\mathbf{x}^{(i)}; \theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \log p_Z(g(\mathbf{x}^{(i)})) + \log \left| \det \frac{\partial g}{\partial \mathbf{x}^{(i)}} \right| \right]

where g=f1g = f^{-1} is the inference direction (data → latent). Everything is differentiable with respect to θ\theta, so standard gradient ascent applies.

Why Exact Likelihood Matters

ModelLikelihoodWhat's optimised
VAELower bound (ELBO)logpKL gap\log p - \text{KL gap}
GANImplicit — not evaluatedAdversarial minimax
DiffusionWeighted lower boundtϵϵ^2\sum_t \| \epsilon - \hat{\epsilon} \|^2
FlowExact logp\log pTrue log-likelihood

Exact likelihood enables direct comparison between models (in bits-per-dimension), reliable density estimation, and use as a prior in larger systems.


11. Applications

Density Estimation

Flows directly answer: "How likely is this data point?" This is valuable for anomaly detection (flag samples with very low logp\log p) and scientific applications where calibrated probabilities matter.

Generative Modelling

Sample zpZ\mathbf{z} \sim p_Z, run the forward pass x=f(z)\mathbf{x} = f(\mathbf{z}). Used in image generation (Glow), audio synthesis (WaveGlow), and molecular design.

Variational Inference

Use a flow as the variational posterior qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}) in a VAE. The flow makes qq more expressive than a diagonal Gaussian without losing tractability — a technique called normalizing flows for variational inference.

Latent Space Manipulation

Because the mapping between x\mathbf{x} and z\mathbf{z} is exact and invertible, you can encode a real image to z\mathbf{z}, manipulate z\mathbf{z} (interpolate, add attribute vectors), then decode back. There is no approximation error unlike VAE-based approaches.


12. Comparison with Other Generative Models

PropertyAutoregressiveVAEGANDiffusionFlow
Exact log-likelihoodYesNo (ELBO)NoNo (bound)Yes
Fast samplingNo (sequential)YesYesNo (steps)Yes
Stable trainingYesYesNoYesYes
Latent spaceImplicitApprox.ImplicitNoisyExact
Memory costLowLowLowMediumHigh
Best known useText (GPT)RepresentationImagesImagesDensity

The main cost of flows is memory and compute: storing the full invertible network with its Jacobian terms is expensive. This is why flows have been largely supplanted by diffusion models for image generation, where scalability matters more than exact likelihood.


Summary

ConceptKey Idea
Change-of-variableslogpX(x)=logpZ(g(x))+logdetJ\log p_X(\mathbf{x}) = \log p_Z(g(\mathbf{x})) + \log\|\det J\|
Coupling layerSplit dimensions; transform half with an unconstrained network; Jacobian is triangular
Autoregressive flowEach dimension conditioned on previous; parallel inference, sequential sampling
Continuous flowODE dynamics; log-density from Jacobian trace instead of determinant
Flow matchingRegress a target vector field — no ODE solve during training
Exact likelihoodTrue logp\log p, not a bound — enables density estimation and model comparison

On this page