Generative Adversarial Networks
How two neural networks locked in competition learn to generate indistinguishably realistic data — from the minimax game to Wasserstein distance and StyleGAN.
Generative Adversarial Networks
In 2014, Ian Goodfellow proposed an entirely different way to train a generative model — not by maximising a likelihood, but by staging a competition.
A Generative Adversarial Network (GAN) pits two neural networks against each other in a zero-sum game. One network, the Generator, tries to fabricate convincing fake data. The other, the Discriminator, tries to tell fakes from real samples. As they compete, the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes — until, at the theoretical optimum, the generator produces data indistinguishable from reality.
The result: state-of-the-art image synthesis, the first photorealistic face generators, and a template for adversarial training that permeates modern deep learning.
1. The Counterfeiter and the Detective
The easiest way to understand GANs is through an analogy.
Imagine a counterfeiter who prints fake banknotes. A detective examines notes and tries to identify fakes. Each time the detective catches a forgery, the counterfeiter studies what gave them away and improves the next batch. Each time the counterfeiter fools the detective, the detective studies the successful fake and sharpens their eye.
This feedback loop drives both parties toward a higher standard:
- The counterfeiter (Generator ) learns to produce increasingly realistic fakes.
- The detective (Discriminator ) learns increasingly subtle distinctions.
The game ends when the counterfeiter's fakes are so good that the detective can do no better than flipping a coin — they literally cannot tell real from fake.
2. The Two Networks
Generator
The generator is a neural network that maps a random latent vector drawn from a simple prior (usually ) to a data sample:
The generator never sees real data directly. It only receives gradient signals from the discriminator telling it how to improve its fakes. Architecturally it is often a stack of transposed convolutions (for images) that upsample from the latent code to full resolution.
Discriminator
The discriminator is a classifier that takes a data sample (real or generated) and outputs a scalar probability:
Architecturally it mirrors the generator in reverse — a stack of strided convolutions that compress an image down to a single scalar.
| Generator | Discriminator | |
|---|---|---|
| Input | Random noise | Data sample (real or fake) |
| Output | Fake sample | Scalar probability |
| Goal | Fool into outputting 1 | Output 1 for real, 0 for fake |
| Architecture | Upsampling (transposed conv) | Downsampling (strided conv) |
3. The Minimax Objective
The formal objective of GAN training is the following minimax game:
Breaking this down:
The discriminator () wants to:
- Make for real data — maximise .
- Make for fake data — maximise .
The generator () wants to:
- Make so the discriminator is fooled — minimise .
Binary Cross-Entropy Connection
is exactly the negative binary cross-entropy of a classifier that labels real samples as 1 and fake samples as 0. The discriminator is just a binary classifier trained with BCE loss — the adversarial twist is that the "fake" examples are generated by a competing network rather than being fixed.
1D Minimax Visualization
4. The Training Loop
GANs are trained by alternating gradient updates — not jointly minimising a single loss.
One Training Iteration
Step 1 — Update the Discriminator (fix , update ):
Sample a minibatch of real data and a minibatch of fake data .
Perform gradient descent on with respect to . In practice is updated times per update (often or in Wasserstein GAN).
Step 2 — Update the Generator (fix , update ):
Sample a new minibatch of latent codes .
The Non-Saturating Generator Loss
The original minimax loss for minimises . In practice this saturates early in training — when easily rejects fakes, and gradients vanish. The standard fix is to instead maximise , which provides stronger gradients when the generator is weak.
GAN Training Iteration
1. Sample Real Data
Fetch a minibatch x from the real dataset.
5. The Optimal Discriminator and Global Minimum
What does the minimax game converge to? Goodfellow et al. proved two key results.
Optimal Discriminator
For a fixed generator defining distribution , the optimal discriminator is:
When , this gives everywhere — the discriminator can do no better than chance, meaning the generator has won.
The Global Minimum
Substituting into the objective, the minimax value becomes:
where is the Jensen-Shannon divergence — a symmetric, bounded version of KL divergence:
The global minimum is achieved when , giving and .
Optimal Discriminator Analysis
Nash Equilibrium not reached. D*(x) can still distinguish samples.
6. Training Instabilities
Despite the elegance of the theory, training GANs in practice is notoriously difficult. The root cause: the JS divergence used by the vanilla GAN is a problematic training signal.
The Vanishing Gradient Problem
When and have disjoint supports (which is almost always true early in training for high-dimensional data), the JS divergence is constant — it equals regardless of how far apart the distributions are.
This means the generator's loss gradient is zero — training stalls completely.
Mode Collapse
Another failure: the generator discovers it can fool the discriminator by producing only a single highly realistic output (or a few), ignoring the diversity of the training data.
For example, when trained on MNIST, the generator might produce only "8"s. The discriminator eventually learns to reject them, the generator switches to "3"s, and the cycle repeats — hopping between modes rather than covering all of them.
| Symptom | Cause |
|---|---|
| Generated samples all look identical | Generator exploiting a single weakness in D |
| Training loss oscillates without converging | G and D chasing each other in circles |
| Sudden quality drops mid-training | G switches mode after D adapts |
Mode Collapse Explorer
8-Gaussian Mixture Target
The Generator is successfully distributing its samples across all known modes of the data distribution.
7. Wasserstein GAN: A Better Distance
Wasserstein GAN (Arjovsky et al., 2017) fixes the vanishing gradient problem by replacing JS divergence with the Wasserstein-1 distance (also called the Earth Mover's Distance, EMD).
The Earth Mover's Distance
Intuitively: imagine and as piles of earth. The EMD is the minimum total work (mass × distance) needed to reshape one pile into the other.
Unlike JS divergence, the Wasserstein distance:
- Is always continuous as a function of the generator's parameters.
- Provides meaningful gradients even when the supports are disjoint.
- Correlates with sample quality — a lower W distance visually means better samples.
The Dual Formulation
Computing the infimum directly is intractable. By the Kantorovich-Rubinstein duality, it is equivalent to:
where the supremum is over all 1-Lipschitz functions . The discriminator (now called the critic) is trained to approximate this optimal .
JS Divergence (Vanilla)
Loss stays constant when distributions are far apart.
Wasserstein Distance
Linear slope provides clear direction even at large distances.
8. Enforcing the Lipschitz Constraint
The critic must be 1-Lipschitz. The original WGAN enforced this by weight clipping — clamping all critic weights to after each update. This works but is crude.
Gradient Penalty (WGAN-GP)
WGAN-GP (Gulrajani et al., 2017) enforces the Lipschitz constraint more elegantly by adding a penalty on the gradient norm of the critic, evaluated at interpolated points between real and fake samples:
The penalty forces along the straight-line paths between real and fake samples, which is exactly where the constraint matters most.
WGAN-GP: Gradient Norm Constraint
Without the penalty, the critic can become arbitrarily steep or flat, causing vanishing or exploding gradients.
9. Conditional GANs
The vanilla GAN generates samples from the full data distribution with no control over the output. Conditional GANs (cGAN, Mirza & Osindero, 2014) add a conditioning signal to both networks:
The conditioning signal can be:
- A class label (generate a cat vs. a dog)
- A text description (text-to-image)
- Another image (image-to-image translation, Pix2Pix)
- A pose or landmark map (human pose synthesis)
The minimax objective simply conditions every expectation on :
Conditional Control (cGAN)
10. Architectural Innovations
DCGAN (2015)
Deep Convolutional GAN established the architectural conventions that made GANs practical for images:
- Generator: series of transposed convolutions (fractionally-strided conv) with batch norm and ReLU.
- Discriminator: series of strided convolutions with batch norm and LeakyReLU.
- No fully-connected layers except at the input/output.
- Batch normalisation in both networks (except final layer of G and first layer of D).
Progressive Growing of GANs (2018)
Training high-resolution GANs directly is unstable — the discriminator can easily distinguish coarse structural mistakes. ProGAN (Karras et al., 2018) solves this by growing both networks incrementally: start at 4×4, then add a layer to reach 8×8, then 16×16, up to 1024×1024. Each resolution is trained to convergence before the next layer is added.
StyleGAN (2019) and StyleGAN2 (2020)
StyleGAN (Karras et al., 2019) introduced the most influential GAN architecture to date:
-
Mapping network : maps the latent code to an intermediate latent space via 8 fully-connected layers. is less entangled than .
-
Adaptive Instance Normalisation (AdaIN): instead of feeding directly as input, is used to modulate the style (scale and bias) of each layer's feature maps:
- Stochastic variation: per-pixel Gaussian noise is added at each layer to model fine-grained stochastic details (individual hair strands, skin pores) separately from global style.
The space offers remarkable properties: different levels of control different levels of style — coarse structure (pose, face shape) at early layers, fine details (colour, texture) at later layers — enabling style mixing by swapping codes at different layer groups.
11. The Latent Space
Unlike VAEs, a GAN's latent space has no explicit structure imposed by a loss term. Yet in practice it develops meaningful geometry — a result of the generator learning a smooth mapping to avoid easily-discriminated discontinuities.
Interpolation
Straight-line interpolation between two latent codes and :
In space (raw Gaussian), linear paths pass through the high-density centre and can cause abrupt changes. In StyleGAN's space, paths are more perceptually linear — intermediate values produce plausible intermediate faces rather than blends.
Disentanglement
In StyleGAN's space, individual directions often correspond to semantic attributes (age, gender, smile, lighting). Finding these directions — via methods like GANSpace or SeFa — enables controlled editing of real images without retraining.
Latent Space Traversal
$\mathbf{w}(\lambda) = (1-\lambda)\mathbf{w}_A + \lambda\mathbf{w}_B$
StyleGAN's W-space is unfolded and disentangled, allowing for linear, perceptually smooth transitions.
12. Evaluating GANs
GANs have no likelihood — you cannot compare models by log-probability. Evaluation requires dedicated metrics.
Fréchet Inception Distance (FID)
The most widely used metric. Both real and generated samples are passed through an Inception-v3 network and the feature activations (at the last pooling layer) are collected. FID fits a multivariate Gaussian to each set and computes the Fréchet distance between them:
| FID value | Interpretation |
|---|---|
| 0 | Generated distribution identical to real (theoretical ideal) |
| < 10 | Excellent — near-photorealistic |
| 10–50 | Good — clearly recognisable |
| > 100 | Poor — visible artefacts or mode collapse |
Lower FID is better.
Inception Score (IS)
Measures two things simultaneously: sharpness (each image is classified confidently into one class) and diversity (the marginal class distribution is broad). Higher IS is better.
IS has fallen out of favour because it correlates poorly with visual quality and can be gamed.
Precision and Recall
Precision: what fraction of generated samples are realistic (fall inside the real data manifold)? Recall: what fraction of the real data manifold is covered by generated samples?
High precision + low recall → mode collapse (realistic but narrow). Low precision + high recall → diverse but blurry/unrealistic.
Fréchet Inception Distance (FID) Visualizer
Conceptual visualization of feature distribution comparison.
Conceptual Feature Extraction
PCA Feature Space (Conceptual 2D)
Lower is better.
13. GAN Variants at a Glance
| Model | Key Innovation | Use Case |
|---|---|---|
| Vanilla GAN | Minimax adversarial training | Proof of concept |
| DCGAN | Convolutional architecture conventions | Image generation |
| cGAN | Class/label conditioning | Controlled generation |
| Pix2Pix | Image-to-image translation with paired data | Style transfer, maps→satellite |
| CycleGAN | Unpaired image translation via cycle consistency | Domain adaptation |
| WGAN-GP | Wasserstein distance + gradient penalty | Stable training |
| ProGAN | Progressive resolution growing | High-res image synthesis |
| StyleGAN2 | Mapping network + AdaIN + space | SOTA face synthesis |
| BigGAN | Large-scale class-conditional, truncation trick | Diverse ImageNet generation |
| GigaGAN | Scale to billions of parameters | Text-to-image |
14. GANs vs. Other Generative Models
| Property | GAN | VAE | Normalizing Flow | Autoregressive | Diffusion |
|---|---|---|---|---|---|
| Sample quality (images) | Highest* | Medium | Low–Med | Medium | Highest |
| Training stability | Poor | Good | Good | Good | Good |
| Exact log-likelihood | No | No (ELBO) | Yes | Yes | No (bound) |
| Fast single-step sampling | Yes | Yes | Yes | No | No |
| Structured latent space | Partial | Yes | Yes | No | No |
| Mode coverage | Often poor | Good | Good | Good | Good |
* StyleGAN2-class models remain competitive with diffusion models on face generation; diffusion leads on general-purpose image synthesis.
The GAN Legacy
Even as diffusion models have surpassed GANs on raw sample quality, the adversarial training idea lives on. Discriminators appear in diffusion model pipelines (as perceptual losses), in RLHF reward models, and in audio codec discriminators (EnCodec, DAC). The "train a judge to train a generator" paradigm is one of the most enduring ideas in deep learning.
Summary
| Concept | Key Idea |
|---|---|
| Minimax game | |
| Optimal discriminator | — coin flip when |
| JS divergence | Objective equivalent to minimising JS — but has zero gradient when supports are disjoint |
| Non-saturating loss | Replace with for the generator — stronger early gradients |
| Mode collapse | Generator ignores diversity to exploit D's weaknesses |
| Wasserstein GAN | Earth mover's distance — smooth gradients everywhere, even with disjoint supports |
| WGAN-GP | Gradient penalty enforces 1-Lipschitz constraint on the critic |
| StyleGAN | Mapping network to space + AdaIN style modulation per layer |
| FID | Fréchet distance between Inception features — the standard evaluation metric |
Variational Autoencoders
How VAEs learn structured latent spaces by combining neural networks with Bayesian inference — from the ELBO to the reparameterisation trick and beyond.
Diffusion Models
How gradually adding and then reversing noise teaches a neural network to generate data — from the forward Markov chain to classifier-free guidance and latent diffusion.