My App
Generative Models

Generative Adversarial Networks

How two neural networks locked in competition learn to generate indistinguishably realistic data — from the minimax game to Wasserstein distance and StyleGAN.

Generative Adversarial Networks

In 2014, Ian Goodfellow proposed an entirely different way to train a generative model — not by maximising a likelihood, but by staging a competition.

A Generative Adversarial Network (GAN) pits two neural networks against each other in a zero-sum game. One network, the Generator, tries to fabricate convincing fake data. The other, the Discriminator, tries to tell fakes from real samples. As they compete, the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes — until, at the theoretical optimum, the generator produces data indistinguishable from reality.

The result: state-of-the-art image synthesis, the first photorealistic face generators, and a template for adversarial training that permeates modern deep learning.


1. The Counterfeiter and the Detective

The easiest way to understand GANs is through an analogy.

Imagine a counterfeiter who prints fake banknotes. A detective examines notes and tries to identify fakes. Each time the detective catches a forgery, the counterfeiter studies what gave them away and improves the next batch. Each time the counterfeiter fools the detective, the detective studies the successful fake and sharpens their eye.

This feedback loop drives both parties toward a higher standard:

  • The counterfeiter (Generator GG) learns to produce increasingly realistic fakes.
  • The detective (Discriminator DD) learns increasingly subtle distinctions.

The game ends when the counterfeiter's fakes are so good that the detective can do no better than flipping a coin — they literally cannot tell real from fake.

Generator (G)
Discriminator (D)
👮
G Loss
0.80
D Loss
0.70
G Success Rate
45.0%

2. The Two Networks

Generator GG

The generator is a neural network that maps a random latent vector z\mathbf{z} drawn from a simple prior (usually N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I})) to a data sample:

Gθ:RdzRdx,x~=Gθ(z),zpz(z)G_\theta : \mathbb{R}^{d_z} \to \mathbb{R}^{d_x}, \qquad \tilde{\mathbf{x}} = G_\theta(\mathbf{z}), \quad \mathbf{z} \sim p_z(\mathbf{z})

The generator never sees real data directly. It only receives gradient signals from the discriminator telling it how to improve its fakes. Architecturally it is often a stack of transposed convolutions (for images) that upsample from the latent code to full resolution.

Discriminator DD

The discriminator is a classifier that takes a data sample (real or generated) and outputs a scalar probability:

Dϕ:Rdx[0,1],Dϕ(x)P(x is real)D_\phi : \mathbb{R}^{d_x} \to [0, 1], \qquad D_\phi(\mathbf{x}) \approx P(\mathbf{x} \text{ is real})

Architecturally it mirrors the generator in reverse — a stack of strided convolutions that compress an image down to a single scalar.

Generator GGDiscriminator DD
InputRandom noise z\mathbf{z}Data sample x\mathbf{x} (real or fake)
OutputFake sample x~\tilde{\mathbf{x}}Scalar probability [0,1]\in [0, 1]
GoalFool DD into outputting 1Output 1 for real, 0 for fake
ArchitectureUpsampling (transposed conv)Downsampling (strided conv)

3. The Minimax Objective

The formal objective of GAN training is the following minimax game:

minθmaxϕ  V(Gθ,Dϕ)=Expdata[logDϕ(x)]+Ezpz[log(1Dϕ(Gθ(z)))]\min_\theta \max_\phi \; \mathcal{V}(G_\theta, D_\phi) = \mathbb{E}_{\mathbf{x} \sim p_\text{data}} [\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z} [\log (1 - D_\phi(G_\theta(\mathbf{z})))]

Breaking this down:

The discriminator (maxϕ\max_\phi) wants to:

  • Make Dϕ(x)1D_\phi(\mathbf{x}) \to 1 for real data — maximise logDϕ(x)\log D_\phi(\mathbf{x}).
  • Make Dϕ(Gθ(z))0D_\phi(G_\theta(\mathbf{z})) \to 0 for fake data — maximise log(1Dϕ(Gθ(z)))\log(1 - D_\phi(G_\theta(\mathbf{z}))).

The generator (minθ\min_\theta) wants to:

  • Make Dϕ(Gθ(z))1D_\phi(G_\theta(\mathbf{z})) \to 1 so the discriminator is fooled — minimise log(1Dϕ(Gθ(z)))\log(1 - D_\phi(G_\theta(\mathbf{z}))).

Binary Cross-Entropy Connection

V\mathcal{V} is exactly the negative binary cross-entropy of a classifier that labels real samples as 1 and fake samples as 0. The discriminator is just a binary classifier trained with BCE loss — the adversarial twist is that the "fake" examples are generated by a competing network rather than being fixed.

1D Minimax Visualization

p_data
p_g
D(x)

4. The Training Loop

GANs are trained by alternating gradient updates — not jointly minimising a single loss.

One Training Iteration

Step 1 — Update the Discriminator (fix GG, update DD):

Sample a minibatch of real data {x(i)}\{\mathbf{x}^{(i)}\} and a minibatch of fake data {Gθ(z(i))}\{G_\theta(\mathbf{z}^{(i)})\}.

LD=1mi=1m[logDϕ(x(i))+log(1Dϕ(Gθ(z(i))))]\mathcal{L}_D = -\frac{1}{m} \sum_{i=1}^{m} \left[ \log D_\phi(\mathbf{x}^{(i)}) + \log(1 - D_\phi(G_\theta(\mathbf{z}^{(i)}))) \right]

Perform gradient descent on LD\mathcal{L}_D with respect to ϕ\phi. In practice DD is updated kk times per GG update (often k=1k = 1 or k=5k = 5 in Wasserstein GAN).

Step 2 — Update the Generator (fix DD, update GG):

Sample a new minibatch of latent codes {z(i)}\{\mathbf{z}^{(i)}\}.

LG=1mi=1mlogDϕ(Gθ(z(i)))\mathcal{L}_G = -\frac{1}{m} \sum_{i=1}^{m} \log D_\phi(G_\theta(\mathbf{z}^{(i)}))

The Non-Saturating Generator Loss

The original minimax loss for GG minimises log(1D(G(z)))\log(1 - D(G(\mathbf{z}))). In practice this saturates early in training — when DD easily rejects fakes, log(1D(G(z)))0\log(1 - D(G(\mathbf{z}))) \approx 0 and gradients vanish. The standard fix is to instead maximise logD(G(z))\log D(G(\mathbf{z})), which provides stronger gradients when the generator is weak.

GAN Training Iteration

📁
Sample Real Data
🎲
Sample Latent Noise
👮
Update Discriminator
🏗️
Update Generator

1. Sample Real Data

Fetch a minibatch x from the real dataset.

G Weights Frozen
D Weights Frozen
Alternating Updates

5. The Optimal Discriminator and Global Minimum

What does the minimax game converge to? Goodfellow et al. proved two key results.

Optimal Discriminator

For a fixed generator defining distribution pgp_g, the optimal discriminator is:

D(x)=pdata(x)pdata(x)+pg(x)D^*(\mathbf{x}) = \frac{p_\text{data}(\mathbf{x})}{p_\text{data}(\mathbf{x}) + p_g(\mathbf{x})}

When pg=pdatap_g = p_\text{data}, this gives D(x)=12D^*(\mathbf{x}) = \frac{1}{2} everywhere — the discriminator can do no better than chance, meaning the generator has won.

The Global Minimum

Substituting DD^* into the objective, the minimax value becomes:

V(G,D)=log4+2DJS(pdatapg)\mathcal{V}(G, D^*) = -\log 4 + 2 \cdot D_\text{JS}(p_\text{data} \| p_g)

where DJSD_\text{JS} is the Jensen-Shannon divergence — a symmetric, bounded version of KL divergence:

DJS(pq)=12DKL(pm)+12DKL(qm),m=p+q2D_\text{JS}(p \| q) = \frac{1}{2} D_\text{KL}(p \| m) + \frac{1}{2} D_\text{KL}(q \| m), \quad m = \frac{p + q}{2}

The global minimum is achieved when pg=pdatap_g = p_\text{data}, giving DJS=0D_\text{JS} = 0 and V=log4\mathcal{V} = -\log 4.

Optimal Discriminator Analysis

p_data
p_g
── D*(x)
D*(x)

Nash Equilibrium not reached. D*(x) can still distinguish samples.


6. Training Instabilities

Despite the elegance of the theory, training GANs in practice is notoriously difficult. The root cause: the JS divergence used by the vanilla GAN is a problematic training signal.

The Vanishing Gradient Problem

When pdatap_\text{data} and pgp_g have disjoint supports (which is almost always true early in training for high-dimensional data), the JS divergence is constant — it equals log2\log 2 regardless of how far apart the distributions are.

DJS(pdatapg)=log2when supports don’t overlapD_\text{JS}(p_\text{data} \| p_g) = \log 2 \quad \text{when supports don't overlap}

This means the generator's loss gradient is zero — training stalls completely.

Mode Collapse

Another failure: the generator discovers it can fool the discriminator by producing only a single highly realistic output (or a few), ignoring the diversity of the training data.

For example, when trained on MNIST, the generator might produce only "8"s. The discriminator eventually learns to reject them, the generator switches to "3"s, and the cycle repeats — hopping between modes rather than covering all of them.

SymptomCause
Generated samples all look identicalGenerator exploiting a single weakness in D
Training loss oscillates without convergingG and D chasing each other in circles
Sudden quality drops mid-trainingG switches mode after D adapts

Mode Collapse Explorer

8-Gaussian Mixture Target

The Generator is successfully distributing its samples across all known modes of the data distribution.


7. Wasserstein GAN: A Better Distance

Wasserstein GAN (Arjovsky et al., 2017) fixes the vanishing gradient problem by replacing JS divergence with the Wasserstein-1 distance (also called the Earth Mover's Distance, EMD).

The Earth Mover's Distance

Intuitively: imagine pdatap_\text{data} and pgp_g as piles of earth. The EMD is the minimum total work (mass × distance) needed to reshape one pile into the other.

W(pdata,pg)=infγΠ(pdata,pg)E(x,y)γ[xy]W(p_\text{data}, p_g) = \inf_{\gamma \in \Pi(p_\text{data}, p_g)} \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \gamma} [\|\mathbf{x} - \mathbf{y}\|]

Unlike JS divergence, the Wasserstein distance:

  • Is always continuous as a function of the generator's parameters.
  • Provides meaningful gradients even when the supports are disjoint.
  • Correlates with sample quality — a lower W distance visually means better samples.

The Dual Formulation

Computing the infimum directly is intractable. By the Kantorovich-Rubinstein duality, it is equivalent to:

W(pdata,pg)=supfL1Expdata[f(x)]Ezpz[f(G(z))]W(p_\text{data}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{\mathbf{x} \sim p_\text{data}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p_z}[f(G(\mathbf{z}))]

where the supremum is over all 1-Lipschitz functions ff. The discriminator (now called the critic) is trained to approximate this optimal ff.

JS Divergence (Vanilla)

Flat Gradient (Training Stalls)

Loss stays constant when distributions are far apart.

Wasserstein Distance

Linear slope provides clear direction even at large distances.

Distributions OverlapDistributions Far Apart

8. Enforcing the Lipschitz Constraint

The critic must be 1-Lipschitz. The original WGAN enforced this by weight clipping — clamping all critic weights to [c,c][-c, c] after each update. This works but is crude.

Gradient Penalty (WGAN-GP)

WGAN-GP (Gulrajani et al., 2017) enforces the Lipschitz constraint more elegantly by adding a penalty on the gradient norm of the critic, evaluated at interpolated points between real and fake samples:

x^=ϵx+(1ϵ)G(z),ϵUniform(0,1)\hat{\mathbf{x}} = \epsilon \mathbf{x} + (1-\epsilon) G(\mathbf{z}), \quad \epsilon \sim \text{Uniform}(0, 1) Lcritic=Ex~[f(x~)]Ex[f(x)]Wasserstein estimate+λEx^[(x^f(x^)21)2]Gradient penalty\mathcal{L}_\text{critic} = \underbrace{\mathbb{E}_{\tilde{\mathbf{x}}}[f(\tilde{\mathbf{x}})] - \mathbb{E}_\mathbf{x}[f(\mathbf{x})]}_{\text{Wasserstein estimate}} + \lambda \underbrace{\mathbb{E}_{\hat{\mathbf{x}}}\left[(\|\nabla_{\hat{\mathbf{x}}} f(\hat{\mathbf{x}})\|_2 - 1)^2\right]}_{\text{Gradient penalty}}

The penalty forces f1\|\nabla f\| \approx 1 along the straight-line paths between real and fake samples, which is exactly where the constraint matters most.

WGAN-GP: Gradient Norm Constraint

Real
Fake
→ ∇f(x̂)

Without the penalty, the critic can become arbitrarily steep or flat, causing vanishing or exploding gradients.


9. Conditional GANs

The vanilla GAN generates samples from the full data distribution with no control over the output. Conditional GANs (cGAN, Mirza & Osindero, 2014) add a conditioning signal c\mathbf{c} to both networks:

Gθ(z,c):x~=G(z,c)G_\theta(\mathbf{z}, \mathbf{c}): \quad \tilde{\mathbf{x}} = G(\mathbf{z}, \mathbf{c}) Dϕ(x,c):real/fake given cD_\phi(\mathbf{x}, \mathbf{c}): \quad \text{real/fake given } \mathbf{c}

The conditioning signal c\mathbf{c} can be:

  • A class label (generate a cat vs. a dog)
  • A text description (text-to-image)
  • Another image (image-to-image translation, Pix2Pix)
  • A pose or landmark map (human pose synthesis)

The minimax objective simply conditions every expectation on c\mathbf{c}:

minGmaxD  Ex,c[logD(x,c)]+Ez,c[log(1D(G(z,c),c))]\min_G \max_D \; \mathbb{E}_{\mathbf{x},\mathbf{c}}[\log D(\mathbf{x}, \mathbf{c})] + \mathbb{E}_{\mathbf{z},\mathbf{c}}[\log(1 - D(G(\mathbf{z}, \mathbf{c}), \mathbf{c}))]

Conditional Control (cGAN)

Latent Noise (z)
+
Class Label (c)
🐱
One-hot: 1, 0, 0
🐱
G(z, c) Result

10. Architectural Innovations

DCGAN (2015)

Deep Convolutional GAN established the architectural conventions that made GANs practical for images:

  • Generator: series of transposed convolutions (fractionally-strided conv) with batch norm and ReLU.
  • Discriminator: series of strided convolutions with batch norm and LeakyReLU.
  • No fully-connected layers except at the input/output.
  • Batch normalisation in both networks (except final layer of G and first layer of D).

Progressive Growing of GANs (2018)

Training high-resolution GANs directly is unstable — the discriminator can easily distinguish coarse structural mistakes. ProGAN (Karras et al., 2018) solves this by growing both networks incrementally: start at 4×4, then add a layer to reach 8×8, then 16×16, up to 1024×1024. Each resolution is trained to convergence before the next layer is added.

StyleGAN (2019) and StyleGAN2 (2020)

StyleGAN (Karras et al., 2019) introduced the most influential GAN architecture to date:

  1. Mapping network ff: maps the latent code zZ\mathbf{z} \in \mathcal{Z} to an intermediate latent space wW\mathbf{w} \in \mathcal{W} via 8 fully-connected layers. W\mathcal{W} is less entangled than Z\mathcal{Z}.

  2. Adaptive Instance Normalisation (AdaIN): instead of feeding w\mathbf{w} directly as input, w\mathbf{w} is used to modulate the style (scale and bias) of each layer's feature maps:

AdaIN(hi,w)=ys,ihiμ(hi)σ(hi)+yb,i\text{AdaIN}(\mathbf{h}_i, \mathbf{w}) = \mathbf{y}_{s,i} \frac{\mathbf{h}_i - \mu(\mathbf{h}_i)}{\sigma(\mathbf{h}_i)} + \mathbf{y}_{b,i}
  1. Stochastic variation: per-pixel Gaussian noise is added at each layer to model fine-grained stochastic details (individual hair strands, skin pores) separately from global style.

The W\mathcal{W} space offers remarkable properties: different levels of w\mathbf{w} control different levels of style — coarse structure (pose, face shape) at early layers, fine details (colour, texture) at later layers — enabling style mixing by swapping w\mathbf{w} codes at different layer groups.


11. The Latent Space

Unlike VAEs, a GAN's latent space has no explicit structure imposed by a loss term. Yet in practice it develops meaningful geometry — a result of the generator learning a smooth mapping to avoid easily-discriminated discontinuities.

Interpolation

Straight-line interpolation between two latent codes zA\mathbf{z}_A and zB\mathbf{z}_B:

z(λ)=(1λ)zA+λzB,λ[0,1]\mathbf{z}(\lambda) = (1-\lambda)\mathbf{z}_A + \lambda\mathbf{z}_B, \quad \lambda \in [0,1]

In Z\mathcal{Z} space (raw Gaussian), linear paths pass through the high-density centre and can cause abrupt changes. In StyleGAN's W\mathcal{W} space, paths are more perceptually linear — intermediate λ\lambda values produce plausible intermediate faces rather than blends.

Disentanglement

In StyleGAN's W\mathcal{W} space, individual directions often correspond to semantic attributes (age, gender, smile, lighting). Finding these directions — via methods like GANSpace or SeFa — enables controlled editing of real images without retraining.

Latent Space Traversal

$\mathbf{w}(\lambda) = (1-\lambda)\mathbf{w}_A + \lambda\mathbf{w}_B$

POINT A
POINT B
λ = 0.0Interpolation Factor: 0.00λ = 1.0

StyleGAN's W-space is unfolded and disentangled, allowing for linear, perceptually smooth transitions.


12. Evaluating GANs

GANs have no likelihood — you cannot compare models by log-probability. Evaluation requires dedicated metrics.

Fréchet Inception Distance (FID)

The most widely used metric. Both real and generated samples are passed through an Inception-v3 network and the feature activations (at the last pooling layer) are collected. FID fits a multivariate Gaussian to each set and computes the Fréchet distance between them:

FID=μrμg2+tr ⁣(Σr+Σg2(ΣrΣg)1/2)\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2 + \text{tr}\!\left(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2(\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g)^{1/2}\right)
FID valueInterpretation
0Generated distribution identical to real (theoretical ideal)
< 10Excellent — near-photorealistic
10–50Good — clearly recognisable
> 100Poor — visible artefacts or mode collapse

Lower FID is better.

Inception Score (IS)

IS=exp ⁣(Expg[DKL(p(yx)p(y))])\text{IS} = \exp\!\left(\mathbb{E}_{\mathbf{x} \sim p_g}\left[D_\text{KL}(p(y|\mathbf{x}) \| p(y))\right]\right)

Measures two things simultaneously: sharpness (each image is classified confidently into one class) and diversity (the marginal class distribution p(y)p(y) is broad). Higher IS is better.

IS has fallen out of favour because it correlates poorly with visual quality and can be gamed.

Precision and Recall

Precision: what fraction of generated samples are realistic (fall inside the real data manifold)? Recall: what fraction of the real data manifold is covered by generated samples?

Precision=p^gMrealp^g,Recall=p^realMgp^real\text{Precision} = \frac{|\hat{p}_g \cap \mathcal{M}_\text{real}|}{|\hat{p}_g|}, \qquad \text{Recall} = \frac{|\hat{p}_\text{real} \cap \mathcal{M}_g|}{|\hat{p}_\text{real}|}

High precision + low recall → mode collapse (realistic but narrow). Low precision + high recall → diverse but blurry/unrealistic.

Fréchet Inception Distance (FID) Visualizer

Conceptual visualization of feature distribution comparison.

real distribution (r)
generated distribution (g)
Fréchet Distance (FID)

Conceptual Feature Extraction

Real Samples
Generated Samples
Inception-v3 (Conceptual)

PCA Feature Space (Conceptual 2D)

PCA Dim 1PCA Dim 2
Poor (Mode Collapse)Optimal (Matches p_data)
Conceptual FID Score
42.00
Good quality
Lower is better.

13. GAN Variants at a Glance

ModelKey InnovationUse Case
Vanilla GANMinimax adversarial trainingProof of concept
DCGANConvolutional architecture conventionsImage generation
cGANClass/label conditioningControlled generation
Pix2PixImage-to-image translation with paired dataStyle transfer, maps→satellite
CycleGANUnpaired image translation via cycle consistencyDomain adaptation
WGAN-GPWasserstein distance + gradient penaltyStable training
ProGANProgressive resolution growingHigh-res image synthesis
StyleGAN2Mapping network + AdaIN + W\mathcal{W} spaceSOTA face synthesis
BigGANLarge-scale class-conditional, truncation trickDiverse ImageNet generation
GigaGANScale to billions of parametersText-to-image

14. GANs vs. Other Generative Models

PropertyGANVAENormalizing FlowAutoregressiveDiffusion
Sample quality (images)Highest*MediumLow–MedMediumHighest
Training stabilityPoorGoodGoodGoodGood
Exact log-likelihoodNoNo (ELBO)YesYesNo (bound)
Fast single-step samplingYesYesYesNoNo
Structured latent spacePartialYesYesNoNo
Mode coverageOften poorGoodGoodGoodGood

* StyleGAN2-class models remain competitive with diffusion models on face generation; diffusion leads on general-purpose image synthesis.

The GAN Legacy

Even as diffusion models have surpassed GANs on raw sample quality, the adversarial training idea lives on. Discriminators appear in diffusion model pipelines (as perceptual losses), in RLHF reward models, and in audio codec discriminators (EnCodec, DAC). The "train a judge to train a generator" paradigm is one of the most enduring ideas in deep learning.


Summary

ConceptKey Idea
Minimax gameminGmaxD  E[logD(x)]+E[log(1D(G(z)))]\min_G \max_D \; \mathbb{E}[\log D(\mathbf{x})] + \mathbb{E}[\log(1-D(G(\mathbf{z})))]
Optimal discriminatorD(x)=pdata/(pdata+pg)D^*(\mathbf{x}) = p_\text{data} / (p_\text{data} + p_g) — coin flip when pg=pdatap_g = p_\text{data}
JS divergenceObjective equivalent to minimising JS — but has zero gradient when supports are disjoint
Non-saturating lossReplace log(1D)\log(1-D) with logD-\log D for the generator — stronger early gradients
Mode collapseGenerator ignores diversity to exploit D's weaknesses
Wasserstein GANEarth mover's distance — smooth gradients everywhere, even with disjoint supports
WGAN-GPGradient penalty enforces 1-Lipschitz constraint on the critic
StyleGANMapping network to W\mathcal{W} space + AdaIN style modulation per layer
FIDFréchet distance between Inception features — the standard evaluation metric

On this page