How invertible neural networks learn exact probability distributions — from the change-of-variables formula to modern coupling layers and continuous flows.

Normalizing Flows

Imagine you have a simple, well-understood probability distribution — a standard Gaussian — and you want to transform it into something far more complex: the distribution of natural images, or the distribution of English sentences. Normalizing flows do exactly this, through a chain of invertible, differentiable transformations.

What makes flows special is their exactness. Unlike VAEs (which optimise a lower bound) or GANs (which use an implicit adversarial objective), normalizing flows give you the exact log-likelihood of any data point. You can both generate samples and evaluate precise probabilities — in both directions.

1. The Core Intuition

Start with a simple "base" distribution — typically an isotropic Gaussian $p_Z(\mathbf{z})$ — and apply a sequence of invertible transformations $f_1, f_2, \dots, f_K$ to produce samples in data space:

\mathbf{z} \sim p_Z(\mathbf{z}), \qquad \mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z})

Because each $f_k$ is invertible, you can run the chain in reverse: given a data point $\mathbf{x}$ , recover the latent code $\mathbf{z}$ exactly.

The word normalizing refers to this reverse direction — mapping complex data back to a "normal" (Gaussian) form. The word flow refers to the sequence of transformations.

Flow Transformation

z ~ N(0, I)

Flow Complexity

Invertible steps transform a Gaussian base into a target shape. Each step maps points deterministically.

Step 0

2. The Change-of-Variables Formula

The mathematical engine behind flows is the change-of-variables formula from calculus. If $\mathbf{x} = f(\mathbf{z})$ and $f$ is a differentiable bijection, then the probability densities are related by:

p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right|

Or equivalently, writing $\mathbf{z} = f^{-1}(\mathbf{x}) = g(\mathbf{x})$ :

\log p_X(\mathbf{x}) = \log p_Z(g(\mathbf{x})) + \log \left| \det \frac{\partial g}{\partial \mathbf{x}} \right|

Term	Meaning
$p_Z(g(\mathbf{x}))$	Likelihood of the latent code under the base distribution
$\frac{\partial g}{\partial \mathbf{x}}$	Jacobian matrix of the inverse transformation
$\left\lvert \det J \right\rvert$	Volume scaling factor — how much the transformation stretches or squishes space

The Volume Correction

Think of probability mass like water. If a transformation compresses a region of space, the density increases there (more water in a smaller volume). If it expands a region, density decreases. The Jacobian determinant is precisely this volume-change correction factor.

For a Chain of Flows

When $K$ transformations are composed, the log-likelihood accumulates one log-Jacobian term per layer:

\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) + \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k^{-1}}{\partial \mathbf{x}_k} \right|

where $\mathbf{x}_0 = \mathbf{z}$ , $\mathbf{x}_k = f_k(\mathbf{x}_{k-1})$ , and $\mathbf{x}_K = \mathbf{x}$ .

This is why the transformations must be invertible: we need $g = f^{-1}$ to map data back to the base space and evaluate $p_Z$ .

Base Space (z)

Warped Data Space (x)

Volume Scaling

Transform Function:

f(z₁, z₂) = (z₁, z₂ · e^(0.8z₁ + 0.5))

Jacobian determinant:

det J = e^(0.8z₁ + 0.5)

Non-linearity (α): 0.80

Scale shift (β): 0.50

Click grid to evaluate

3. The Two Requirements for a Flow Layer

Every building block in a normalizing flow must satisfy exactly two constraints:

Invertible: Given $\mathbf{x} = f(\mathbf{z})$ , we must be able to recover $\mathbf{z} = f^{-1}(\mathbf{x})$ efficiently.
Tractable Jacobian: We must be able to compute $\log|\det J|$ efficiently — ideally in $O(d)$ rather than $O(d^3)$ (the cost of a naive determinant for a $d \times d$ matrix).

These two constraints are what drive all of the architectural innovations in normalizing flows. Different flow families make different design choices to satisfy them cheaply.

4. Planar and Radial Flows (Beginner Building Blocks)

The earliest flows (Rezende & Mohamed, 2015) used simple parameterisations.

Planar Flow

f(\mathbf{z}) = \mathbf{z} + \mathbf{u} \cdot h(\mathbf{w}^T \mathbf{z} + b)

This pushes the distribution along a single hyperplane defined by $\mathbf{w}$ . The Jacobian determinant has a closed form via the matrix determinant lemma:

\left| \det \frac{\partial f}{\partial \mathbf{z}} \right| = \left| 1 + \mathbf{u}^T h'(\mathbf{w}^T \mathbf{z} + b)\mathbf{w} \right|

Radial Flow

f(\mathbf{z}) = \mathbf{z} + \beta \frac{\mathbf{z} - \mathbf{z}_0}{\alpha + \|\mathbf{z} - \mathbf{z}_0\|}

Contracts or expands the distribution around a reference point $\mathbf{z}_0$ .

Limitation

Planar and radial flows are expressive in low dimensions but scale poorly — stacking many layers is required to approximate complex distributions in high dimensions.

5. Affine Coupling Layers (RealNVP)

The affine coupling layer (Dinh et al., 2017 — RealNVP) is the workhorse of practical normalizing flows. It achieves both invertibility and a cheap Jacobian by cleverly splitting the input dimensions.

The Split-and-Transform Trick

Partition $\mathbf{x} \in \mathbb{R}^d$ into two halves: $\mathbf{x} = [\mathbf{x}_{1:m}, \, \mathbf{x}_{m+1:d}]$ .

Forward pass (generation direction $\mathbf{z} \to \mathbf{x}$ ):

\mathbf{y}_{1:m} = \mathbf{z}_{1:m}

\mathbf{y}_{m+1:d} = \mathbf{z}_{m+1:d} \odot \exp(s(\mathbf{z}_{1:m})) + t(\mathbf{z}_{1:m})

Inverse pass (inference direction $\mathbf{x} \to \mathbf{z}$ ):

\mathbf{z}_{1:m} = \mathbf{y}_{1:m}

\mathbf{z}_{m+1:d} = (\mathbf{y}_{m+1:d} - t(\mathbf{y}_{1:m})) \odot \exp(-s(\mathbf{y}_{1:m}))

where $s(\cdot)$ and $t(\cdot)$ are arbitrary neural networks (scale and translation functions).

Why the Jacobian Is Cheap

Because $\mathbf{y}_{1:m} = \mathbf{z}_{1:m}$ (the first half passes through unchanged), the Jacobian matrix is lower triangular. The determinant of a triangular matrix is just the product of its diagonal entries:

\log \left| \det J \right| = \sum_{j=m+1}^{d} s_j(\mathbf{z}_{1:m})

This is $O(d)$ — no matrix operations needed. The expressive power comes entirely from $s$ and $t$ , which can be deep networks.

Affine Coupling Layer

Forward Pass (z → y)

Input (z)

z₁:m

zₘ₊₁:d

Networks

s(z₁:m)

t(z₁:m)

⊙

Output (y)

y₁:m

yₘ₊₁:d

Step 1: Input vector z is partitioned into two halves.

Jacobian Determinant

log|det J| = 0

6. Forward vs. Inverse Direction

A critical distinction in flow models is which direction is cheap to compute:

	Generation $\mathbf{z} \to \mathbf{x}$	Inference / Density $\mathbf{x} \to \mathbf{z}$
RealNVP (coupling)	$O(d)$	$O(d)$
MAF (autoregressive)	$O(d)$ sequential	$O(1)$ parallel
IAF (inverse autoregressive)	$O(1)$ parallel	$O(d)$ sequential

This trade-off is fundamental. Models optimised for fast density evaluation (MAF) are slow at sampling; models optimised for fast sampling (IAF) are slow at density evaluation.

Latent Space (z)

Invertible
Mapping

Data Space (x)

Latent z

[0.200, -0.300]

Data x

[0.200, -0.018]

7. Autoregressive Flows

Autoregressive flows (Masked Autoregressive Flow — MAF, Papamakarios et al., 2017) apply the same chain-rule factorisation used in autoregressive generative models, but in the context of a flow.

The Autoregressive Structure

x_i = z_i \cdot \exp(\alpha_i(x_{1:i-1})) + \mu_i(x_{1:i-1}), \quad i = 1, \dots, d

Each dimension $x_i$ is an affine function of the corresponding latent variable $z_i$ , with scale $\alpha_i$ and shift $\mu_i$ conditioned on all previous dimensions.

The Jacobian

Because $x_i$ depends only on $z_i$ and earlier $x$ s (not later ones), the Jacobian $\frac{\partial \mathbf{x}}{\partial \mathbf{z}}$ is triangular:

\log \left| \det J \right| = \sum_{i=1}^{d} \alpha_i(x_{1:i-1})

Masking for Parallelism: MADE

In practice, the autoregressive conditioners $\alpha_i, \mu_i$ are implemented using a single neural network with masked weights (MADE) — binary masks that zero out connections from $x_j$ to $x_i$ whenever $j \geq i$ . This enforces the autoregressive constraint while allowing a single forward pass to compute all conditioners simultaneously.

Sequential Sampling

$z \to x$ : $\mathcal{O}(d)$ complexity

Latent Base

z₁

z₂

z₃

z₄

z₅

z₆

MADE (Masked Autoencoder)

x₁

x₂

x₃

x₄

x₅

x₆

Data Space

Dependency Mask

Sampling is sequential. Each $x_i$ must wait for $x_{1:i-1}$ to be computed before it can be generated.

8. Stacking Flow Layers

A single flow layer has limited capacity. Power comes from composing many layers. The log-likelihood simply accumulates:

\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}_0) + \sum_{k=1}^{K} \log \left| \det J_k \right|

Between coupling layers, flows typically apply permutations or 1×1 convolutions (Glow, Kingma & Dhariwal 2018) to mix the dimensions that were held fixed by the previous layer — otherwise, the same dimensions would always be left unchanged.

The Glow Architecture (Image Flows)

Glow extended RealNVP to images with three key additions:

Actnorm — a data-dependent batch normalisation alternative stable in small batches.
Invertible 1×1 convolution — a learned permutation parameterised as an invertible matrix, replacing fixed channel shuffles.
Multi-scale architecture — after every few flow steps, half the channels are factored out directly (split off to a Gaussian), reducing computational cost at deeper layers.

Composition Stack

BASE DISTRIBUTION (z)

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+1.025

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+1.090

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+0.992

Glow Block

Actnorm → 1x1 Conv → Coupling

log|det J|

+0.667

DATA SPACE (x)

Global Density

Total Log-Likelihood

3.77

Layer Contributions

Stack Depth (K)

14 Layers8

9. Continuous Normalizing Flows

Continuous Normalizing Flows (CNFs, Chen et al., 2018) take the limit of infinitely many infinitesimally thin flow layers. Rather than a discrete sequence of transformations, the latent variable evolves according to an ordinary differential equation (ODE):

\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t), \qquad t \in [0, 1]

Starting from $\mathbf{z}(0) \sim p_Z$ , we solve this ODE forward in time to reach $\mathbf{x} = \mathbf{z}(1)$ .

The Continuous Change-of-Variables Formula

The log-density evolves according to the instantaneous change-of-variables formula (Liouville's theorem):

\frac{d \log p(\mathbf{z}(t))}{dt} = -\text{tr}\!\left(\frac{\partial f_\theta}{\partial \mathbf{z}(t)}\right)

The Jacobian trace (sum of diagonal entries) replaces the full Jacobian determinant, which can be estimated cheaply using the Hutchinson estimator:

\text{tr}(J) \approx \mathbf{\epsilon}^T J \mathbf{\epsilon}, \qquad \mathbf{\epsilon} \sim \mathcal{N}(0, I)

Flow Matching: Training Without ODE Solvers

A major recent advance (Flow Matching, Lipman et al., 2022; Liu et al., 2022) sidesteps the expensive ODE integration during training entirely. Instead of maximising likelihood, the model is trained to directly regress a target vector field that transports the base distribution to the data distribution:

\mathcal{L}_\text{FM}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \mathbf{x}_1} \left\| f_\theta(\mathbf{x}_t, t) - u_t(\mathbf{x}_t \mid \mathbf{x}_0, \mathbf{x}_1) \right\|^2

This is a simple regression loss — no likelihood, no Jacobian, just matching a vector field. Stability Diffusion 3 and many modern image models use flow matching.

CNF vs. Discrete Flows

CNFs are more flexible (the ODE can take any path through space) but slower at training and inference due to ODE integration. Discrete flows are faster but constrained to a fixed architecture. Flow matching closes this gap by training CNFs with a simple, stable regression objective.

10. Training Objective

For discrete flows, training maximises the exact log-likelihood over a dataset $\mathcal{D} = \{\mathbf{x}^{(i)}\}$ :

\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log p_X(\mathbf{x}^{(i)}; \theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \log p_Z(g(\mathbf{x}^{(i)})) + \log \left| \det \frac{\partial g}{\partial \mathbf{x}^{(i)}} \right| \right]

where $g = f^{-1}$ is the inference direction (data → latent). Everything is differentiable with respect to $\theta$ , so standard gradient ascent applies.

Why Exact Likelihood Matters

Model	Likelihood	What's optimised
VAE	Lower bound (ELBO)	$\log p - \text{KL gap}$
GAN	Implicit — not evaluated	Adversarial minimax
Diffusion	Weighted lower bound	$\sum_t \\| \epsilon - \hat{\epsilon} \\|^2$
Flow	Exact $\log p$	True log-likelihood

Exact likelihood enables direct comparison between models (in bits-per-dimension), reliable density estimation, and use as a prior in larger systems.

11. Applications

Density Estimation

Flows directly answer: "How likely is this data point?" This is valuable for anomaly detection (flag samples with very low $\log p$ ) and scientific applications where calibrated probabilities matter.

Generative Modelling

Sample $\mathbf{z} \sim p_Z$ , run the forward pass $\mathbf{x} = f(\mathbf{z})$ . Used in image generation (Glow), audio synthesis (WaveGlow), and molecular design.

Variational Inference

Use a flow as the variational posterior $q_\phi(\mathbf{z} \mid \mathbf{x})$ in a VAE. The flow makes $q$ more expressive than a diagonal Gaussian without losing tractability — a technique called normalizing flows for variational inference.

Latent Space Manipulation

Because the mapping between $\mathbf{x}$ and $\mathbf{z}$ is exact and invertible, you can encode a real image to $\mathbf{z}$ , manipulate $\mathbf{z}$ (interpolate, add attribute vectors), then decode back. There is no approximation error unlike VAE-based approaches.

12. Comparison with Other Generative Models

Property	Autoregressive	VAE	GAN	Diffusion	Flow
Exact log-likelihood	Yes	No (ELBO)	No	No (bound)	Yes
Fast sampling	No (sequential)	Yes	Yes	No (steps)	Yes
Stable training	Yes	Yes	No	Yes	Yes
Latent space	Implicit	Approx.	Implicit	Noisy	Exact
Memory cost	Low	Low	Low	Medium	High
Best known use	Text (GPT)	Representation	Images	Images	Density

The main cost of flows is memory and compute: storing the full invertible network with its Jacobian terms is expensive. This is why flows have been largely supplanted by diffusion models for image generation, where scalability matters more than exact likelihood.

Summary

Concept	Key Idea
Change-of-variables	$\log p_X(\mathbf{x}) = \log p_Z(g(\mathbf{x})) + \log\\|\det J\\|$
Coupling layer	Split dimensions; transform half with an unconstrained network; Jacobian is triangular
Autoregressive flow	Each dimension conditioned on previous; parallel inference, sequential sampling
Continuous flow	ODE dynamics; log-density from Jacobian trace instead of determinant
Flow matching	Regress a target vector field — no ODE solve during training
Exact likelihood	True $\log p$ , not a bound — enables density estimation and model comparison

Normalizing Flows

Flow Complexity

Volume Scaling

Affine Coupling Layer

Sequential Sampling

Dependency Mask

Composition Stack

Global Density

On this page