How two neural networks locked in competition learn to generate indistinguishably realistic data — from the minimax game to Wasserstein distance and StyleGAN.

Generative Adversarial Networks

In 2014, Ian Goodfellow proposed an entirely different way to train a generative model — not by maximising a likelihood, but by staging a competition.

A Generative Adversarial Network (GAN) pits two neural networks against each other in a zero-sum game. One network, the Generator, tries to fabricate convincing fake data. The other, the Discriminator, tries to tell fakes from real samples. As they compete, the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes — until, at the theoretical optimum, the generator produces data indistinguishable from reality.

The result: state-of-the-art image synthesis, the first photorealistic face generators, and a template for adversarial training that permeates modern deep learning.

1. The Counterfeiter and the Detective

The easiest way to understand GANs is through an analogy.

Imagine a counterfeiter who prints fake banknotes. A detective examines notes and tries to identify fakes. Each time the detective catches a forgery, the counterfeiter studies what gave them away and improves the next batch. Each time the counterfeiter fools the detective, the detective studies the successful fake and sharpens their eye.

This feedback loop drives both parties toward a higher standard:

The counterfeiter (Generator $G$ ) learns to produce increasingly realistic fakes.
The detective (Discriminator $D$ ) learns increasingly subtle distinctions.

The game ends when the counterfeiter's fakes are so good that the detective can do no better than flipping a coin — they literally cannot tell real from fake.

Generator (G)

Discriminator (D)

👮

G Loss

0.80

D Loss

0.70

G Success Rate

45.0%

2. The Two Networks

Generator $G$

The generator is a neural network that maps a random latent vector $\mathbf{z}$ drawn from a simple prior (usually $\mathcal{N}(\mathbf{0}, \mathbf{I})$ ) to a data sample:

G_\theta : \mathbb{R}^{d_z} \to \mathbb{R}^{d_x}, \qquad \tilde{\mathbf{x}} = G_\theta(\mathbf{z}), \quad \mathbf{z} \sim p_z(\mathbf{z})

The generator never sees real data directly. It only receives gradient signals from the discriminator telling it how to improve its fakes. Architecturally it is often a stack of transposed convolutions (for images) that upsample from the latent code to full resolution.

Discriminator $D$

The discriminator is a classifier that takes a data sample (real or generated) and outputs a scalar probability:

D_\phi : \mathbb{R}^{d_x} \to [0, 1], \qquad D_\phi(\mathbf{x}) \approx P(\mathbf{x} \text{ is real})

Architecturally it mirrors the generator in reverse — a stack of strided convolutions that compress an image down to a single scalar.

	Generator $G$	Discriminator $D$
Input	Random noise $\mathbf{z}$	Data sample $\mathbf{x}$ (real or fake)
Output	Fake sample $\tilde{\mathbf{x}}$	Scalar probability $\in [0, 1]$
Goal	Fool $D$ into outputting 1	Output 1 for real, 0 for fake
Architecture	Upsampling (transposed conv)	Downsampling (strided conv)

3. The Minimax Objective

The formal objective of GAN training is the following minimax game:

\min_\theta \max_\phi \; \mathcal{V}(G_\theta, D_\phi) = \mathbb{E}_{\mathbf{x} \sim p_\text{data}} [\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z} [\log (1 - D_\phi(G_\theta(\mathbf{z})))]

Breaking this down:

The discriminator ( $\max_\phi$ ) wants to:

Make $D_\phi(\mathbf{x}) \to 1$ for real data — maximise $\log D_\phi(\mathbf{x})$ .
Make $D_\phi(G_\theta(\mathbf{z})) \to 0$ for fake data — maximise $\log(1 - D_\phi(G_\theta(\mathbf{z})))$ .

The generator ( $\min_\theta$ ) wants to:

Make $D_\phi(G_\theta(\mathbf{z})) \to 1$ so the discriminator is fooled — minimise $\log(1 - D_\phi(G_\theta(\mathbf{z})))$ .

Binary Cross-Entropy Connection

$\mathcal{V}$ is exactly the negative binary cross-entropy of a classifier that labels real samples as 1 and fake samples as 0. The discriminator is just a binary classifier trained with BCE loss — the adversarial twist is that the "fake" examples are generated by a competing network rather than being fixed.

1D Minimax Visualization

p_data

p_g

D(x)

Train Generator (Shift μ)-2.0

4. The Training Loop

GANs are trained by alternating gradient updates — not jointly minimising a single loss.

One Training Iteration

Step 1 — Update the Discriminator (fix $G$ , update $D$ ):

Sample a minibatch of real data $\{\mathbf{x}^{(i)}\}$ and a minibatch of fake data $\{G_\theta(\mathbf{z}^{(i)})\}$ .

\mathcal{L}_D = -\frac{1}{m} \sum_{i=1}^{m} \left[ \log D_\phi(\mathbf{x}^{(i)}) + \log(1 - D_\phi(G_\theta(\mathbf{z}^{(i)}))) \right]

Perform gradient descent on $\mathcal{L}_D$ with respect to $\phi$ . In practice $D$ is updated $k$ times per $G$ update (often $k = 1$ or $k = 5$ in Wasserstein GAN).

Step 2 — Update the Generator (fix $D$ , update $G$ ):

Sample a new minibatch of latent codes $\{\mathbf{z}^{(i)}\}$ .

\mathcal{L}_G = -\frac{1}{m} \sum_{i=1}^{m} \log D_\phi(G_\theta(\mathbf{z}^{(i)}))

The Non-Saturating Generator Loss

The original minimax loss for $G$ minimises $\log(1 - D(G(\mathbf{z})))$ . In practice this saturates early in training — when $D$ easily rejects fakes, $\log(1 - D(G(\mathbf{z}))) \approx 0$ and gradients vanish. The standard fix is to instead maximise $\log D(G(\mathbf{z}))$ , which provides stronger gradients when the generator is weak.

GAN Training Iteration

📁

→

Sample Real Data

🎲

→

Sample Latent Noise

👮

→

Update Discriminator

🏗️

Update Generator

1. Sample Real Data

Fetch a minibatch x from the real dataset.

G Weights Frozen

D Weights Frozen

Alternating Updates

5. The Optimal Discriminator and Global Minimum

What does the minimax game converge to? Goodfellow et al. proved two key results.

Optimal Discriminator

For a fixed generator defining distribution $p_g$ , the optimal discriminator is:

D^*(\mathbf{x}) = \frac{p_\text{data}(\mathbf{x})}{p_\text{data}(\mathbf{x}) + p_g(\mathbf{x})}

When $p_g = p_\text{data}$ , this gives $D^*(\mathbf{x}) = \frac{1}{2}$ everywhere — the discriminator can do no better than chance, meaning the generator has won.

The Global Minimum

Substituting $D^*$ into the objective, the minimax value becomes:

\mathcal{V}(G, D^*) = -\log 4 + 2 \cdot D_\text{JS}(p_\text{data} \| p_g)

where $D_\text{JS}$ is the Jensen-Shannon divergence — a symmetric, bounded version of KL divergence:

D_\text{JS}(p \| q) = \frac{1}{2} D_\text{KL}(p \| m) + \frac{1}{2} D_\text{KL}(q \| m), \quad m = \frac{p + q}{2}

The global minimum is achieved when $p_g = p_\text{data}$ , giving $D_\text{JS} = 0$ and $\mathcal{V} = -\log 4$ .

Optimal Discriminator Analysis

p_data

p_g── D*(x)

Generator Mean (μ)-2.50

Generator Variance (σ)0.80

Nash Equilibrium not reached. D*(x) can still distinguish samples.

6. Training Instabilities

Despite the elegance of the theory, training GANs in practice is notoriously difficult. The root cause: the JS divergence used by the vanilla GAN is a problematic training signal.

The Vanishing Gradient Problem

When $p_\text{data}$ and $p_g$ have disjoint supports (which is almost always true early in training for high-dimensional data), the JS divergence is constant — it equals $\log 2$ regardless of how far apart the distributions are.

D_\text{JS}(p_\text{data} \| p_g) = \log 2 \quad \text{when supports don't overlap}

This means the generator's loss gradient is zero — training stalls completely.

Mode Collapse

Another failure: the generator discovers it can fool the discriminator by producing only a single highly realistic output (or a few), ignoring the diversity of the training data.

For example, when trained on MNIST, the generator might produce only "8"s. The discriminator eventually learns to reject them, the generator switches to "3"s, and the cycle repeats — hopping between modes rather than covering all of them.

Symptom	Cause
Generated samples all look identical	Generator exploiting a single weakness in D
Training loss oscillates without converging	G and D chasing each other in circles
Sudden quality drops mid-training	G switches mode after D adapts

Mode Collapse Explorer

8-Gaussian Mixture Target

The Generator is successfully distributing its samples across all known modes of the data distribution.

7. Wasserstein GAN: A Better Distance

Wasserstein GAN (Arjovsky et al., 2017) fixes the vanishing gradient problem by replacing JS divergence with the Wasserstein-1 distance (also called the Earth Mover's Distance, EMD).

The Earth Mover's Distance

Intuitively: imagine $p_\text{data}$ and $p_g$ as piles of earth. The EMD is the minimum total work (mass × distance) needed to reshape one pile into the other.

W(p_\text{data}, p_g) = \inf_{\gamma \in \Pi(p_\text{data}, p_g)} \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \gamma} [\|\mathbf{x} - \mathbf{y}\|]

Unlike JS divergence, the Wasserstein distance:

Is always continuous as a function of the generator's parameters.
Provides meaningful gradients even when the supports are disjoint.
Correlates with sample quality — a lower W distance visually means better samples.

The Dual Formulation

Computing the infimum directly is intractable. By the Kantorovich-Rubinstein duality, it is equivalent to:

W(p_\text{data}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{\mathbf{x} \sim p_\text{data}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{z} \sim p_z}[f(G(\mathbf{z}))]

where the supremum is over all 1-Lipschitz functions $f$ . The discriminator (now called the critic) is trained to approximate this optimal $f$ .

JS Divergence (Vanilla)

Flat Gradient (Training Stalls)

Loss stays constant when distributions are far apart.

Wasserstein Distance

Linear slope provides clear direction even at large distances.

Distributions OverlapDistributions Far Apart

8. Enforcing the Lipschitz Constraint

The critic must be 1-Lipschitz. The original WGAN enforced this by weight clipping — clamping all critic weights to $[-c, c]$ after each update. This works but is crude.

Gradient Penalty (WGAN-GP)

WGAN-GP (Gulrajani et al., 2017) enforces the Lipschitz constraint more elegantly by adding a penalty on the gradient norm of the critic, evaluated at interpolated points between real and fake samples:

\hat{\mathbf{x}} = \epsilon \mathbf{x} + (1-\epsilon) G(\mathbf{z}), \quad \epsilon \sim \text{Uniform}(0, 1)

\mathcal{L}_\text{critic} = \underbrace{\mathbb{E}_{\tilde{\mathbf{x}}}[f(\tilde{\mathbf{x}})] - \mathbb{E}_\mathbf{x}[f(\mathbf{x})]}_{\text{Wasserstein estimate}} + \lambda \underbrace{\mathbb{E}_{\hat{\mathbf{x}}}\left[(\|\nabla_{\hat{\mathbf{x}}} f(\hat{\mathbf{x}})\|_2 - 1)^2\right]}_{\text{Gradient penalty}}

The penalty forces $\|\nabla f\| \approx 1$ along the straight-line paths between real and fake samples, which is exactly where the constraint matters most.

WGAN-GP: Gradient Norm Constraint

Real

Fake→ ∇f(x̂)

Without the penalty, the critic can become arbitrarily steep or flat, causing vanishing or exploding gradients.

9. Conditional GANs

The vanilla GAN generates samples from the full data distribution with no control over the output. Conditional GANs (cGAN, Mirza & Osindero, 2014) add a conditioning signal $\mathbf{c}$ to both networks:

G_\theta(\mathbf{z}, \mathbf{c}): \quad \tilde{\mathbf{x}} = G(\mathbf{z}, \mathbf{c})

D_\phi(\mathbf{x}, \mathbf{c}): \quad \text{real/fake given } \mathbf{c}

The conditioning signal $\mathbf{c}$ can be:

A class label (generate a cat vs. a dog)
A text description (text-to-image)
Another image (image-to-image translation, Pix2Pix)
A pose or landmark map (human pose synthesis)

The minimax objective simply conditions every expectation on $\mathbf{c}$ :

\min_G \max_D \; \mathbb{E}_{\mathbf{x},\mathbf{c}}[\log D(\mathbf{x}, \mathbf{c})] + \mathbb{E}_{\mathbf{z},\mathbf{c}}[\log(1 - D(G(\mathbf{z}, \mathbf{c}), \mathbf{c}))]

Conditional Control (cGAN)

Latent Noise (z)

Class Label (c)

🐱

One-hot: 1, 0, 0

→

🐱

G(z, c) Result

10. Architectural Innovations

DCGAN (2015)

Deep Convolutional GAN established the architectural conventions that made GANs practical for images:

Generator: series of transposed convolutions (fractionally-strided conv) with batch norm and ReLU.
Discriminator: series of strided convolutions with batch norm and LeakyReLU.
No fully-connected layers except at the input/output.
Batch normalisation in both networks (except final layer of G and first layer of D).

Progressive Growing of GANs (2018)

Training high-resolution GANs directly is unstable — the discriminator can easily distinguish coarse structural mistakes. ProGAN (Karras et al., 2018) solves this by growing both networks incrementally: start at 4×4, then add a layer to reach 8×8, then 16×16, up to 1024×1024. Each resolution is trained to convergence before the next layer is added.

StyleGAN (2019) and StyleGAN2 (2020)

StyleGAN (Karras et al., 2019) introduced the most influential GAN architecture to date:

Mapping network $f$ : maps the latent code $\mathbf{z} \in \mathcal{Z}$ to an intermediate latent space $\mathbf{w} \in \mathcal{W}$ via 8 fully-connected layers. $\mathcal{W}$ is less entangled than $\mathcal{Z}$ .
Adaptive Instance Normalisation (AdaIN): instead of feeding $\mathbf{w}$ directly as input, $\mathbf{w}$ is used to modulate the style (scale and bias) of each layer's feature maps:

\text{AdaIN}(\mathbf{h}_i, \mathbf{w}) = \mathbf{y}_{s,i} \frac{\mathbf{h}_i - \mu(\mathbf{h}_i)}{\sigma(\mathbf{h}_i)} + \mathbf{y}_{b,i}

Stochastic variation: per-pixel Gaussian noise is added at each layer to model fine-grained stochastic details (individual hair strands, skin pores) separately from global style.

The $\mathcal{W}$ space offers remarkable properties: different levels of $\mathbf{w}$ control different levels of style — coarse structure (pose, face shape) at early layers, fine details (colour, texture) at later layers — enabling style mixing by swapping $\mathbf{w}$ codes at different layer groups.

11. The Latent Space

Unlike VAEs, a GAN's latent space has no explicit structure imposed by a loss term. Yet in practice it develops meaningful geometry — a result of the generator learning a smooth mapping to avoid easily-discriminated discontinuities.

Interpolation

Straight-line interpolation between two latent codes $\mathbf{z}_A$ and $\mathbf{z}_B$ :

\mathbf{z}(\lambda) = (1-\lambda)\mathbf{z}_A + \lambda\mathbf{z}_B, \quad \lambda \in [0,1]

In $\mathcal{Z}$ space (raw Gaussian), linear paths pass through the high-density centre and can cause abrupt changes. In StyleGAN's $\mathcal{W}$ space, paths are more perceptually linear — intermediate $\lambda$ values produce plausible intermediate faces rather than blends.

Disentanglement

In StyleGAN's $\mathcal{W}$ space, individual directions often correspond to semantic attributes (age, gender, smile, lighting). Finding these directions — via methods like GANSpace or SeFa — enables controlled editing of real images without retraining.

Latent Space Traversal

$\mathbf{w}(\lambda) = (1-\lambda)\mathbf{w}_A + \lambda\mathbf{w}_B$

POINT A

POINT B

λ = 0.0Interpolation Factor: 0.00λ = 1.0

StyleGAN's W-space is unfolded and disentangled, allowing for linear, perceptually smooth transitions.

12. Evaluating GANs

GANs have no likelihood — you cannot compare models by log-probability. Evaluation requires dedicated metrics.

Fréchet Inception Distance (FID)

The most widely used metric. Both real and generated samples are passed through an Inception-v3 network and the feature activations (at the last pooling layer) are collected. FID fits a multivariate Gaussian to each set and computes the Fréchet distance between them:

\text{FID} = \|\boldsymbol{\mu}_r - \boldsymbol{\mu}_g\|^2 + \text{tr}\!\left(\boldsymbol{\Sigma}_r + \boldsymbol{\Sigma}_g - 2(\boldsymbol{\Sigma}_r \boldsymbol{\Sigma}_g)^{1/2}\right)

FID value	Interpretation
0	Generated distribution identical to real (theoretical ideal)
< 10	Excellent — near-photorealistic
10–50	Good — clearly recognisable
> 100	Poor — visible artefacts or mode collapse

Lower FID is better.

Inception Score (IS)

\text{IS} = \exp\!\left(\mathbb{E}_{\mathbf{x} \sim p_g}\left[D_\text{KL}(p(y|\mathbf{x}) \| p(y))\right]\right)

Measures two things simultaneously: sharpness (each image is classified confidently into one class) and diversity (the marginal class distribution $p(y)$ is broad). Higher IS is better.

IS has fallen out of favour because it correlates poorly with visual quality and can be gamed.

Precision and Recall

Precision: what fraction of generated samples are realistic (fall inside the real data manifold)? Recall: what fraction of the real data manifold is covered by generated samples?

\text{Precision} = \frac{|\hat{p}_g \cap \mathcal{M}_\text{real}|}{|\hat{p}_g|}, \qquad \text{Recall} = \frac{|\hat{p}_\text{real} \cap \mathcal{M}_g|}{|\hat{p}_\text{real}|}

High precision + low recall → mode collapse (realistic but narrow). Low precision + high recall → diverse but blurry/unrealistic.

Fréchet Inception Distance (FID) Visualizer

Conceptual visualization of feature distribution comparison.

real distribution (r)

generated distribution (g)

Fréchet Distance (FID)

Conceptual Feature Extraction

Real Samples

Generated Samples

Inception-v3 (Conceptual)

→

PCA Feature Space (Conceptual 2D)

Set Generation Quality (Controls Fake Distribution)0.50

Poor (Mode Collapse)Optimal (Matches p_data)

Conceptual FID Score

42.00

Good quality
Lower is better.

13. GAN Variants at a Glance

Model	Key Innovation	Use Case
Vanilla GAN	Minimax adversarial training	Proof of concept
DCGAN	Convolutional architecture conventions	Image generation
cGAN	Class/label conditioning	Controlled generation
Pix2Pix	Image-to-image translation with paired data	Style transfer, maps→satellite
CycleGAN	Unpaired image translation via cycle consistency	Domain adaptation
WGAN-GP	Wasserstein distance + gradient penalty	Stable training
ProGAN	Progressive resolution growing	High-res image synthesis
StyleGAN2	Mapping network + AdaIN + $\mathcal{W}$ space	SOTA face synthesis
BigGAN	Large-scale class-conditional, truncation trick	Diverse ImageNet generation
GigaGAN	Scale to billions of parameters	Text-to-image

14. GANs vs. Other Generative Models

Property	GAN	VAE	Normalizing Flow	Autoregressive	Diffusion
Sample quality (images)	Highest*	Medium	Low–Med	Medium	Highest
Training stability	Poor	Good	Good	Good	Good
Exact log-likelihood	No	No (ELBO)	Yes	Yes	No (bound)
Fast single-step sampling	Yes	Yes	Yes	No	No
Structured latent space	Partial	Yes	Yes	No	No
Mode coverage	Often poor	Good	Good	Good	Good

* StyleGAN2-class models remain competitive with diffusion models on face generation; diffusion leads on general-purpose image synthesis.

The GAN Legacy

Even as diffusion models have surpassed GANs on raw sample quality, the adversarial training idea lives on. Discriminators appear in diffusion model pipelines (as perceptual losses), in RLHF reward models, and in audio codec discriminators (EnCodec, DAC). The "train a judge to train a generator" paradigm is one of the most enduring ideas in deep learning.

Summary

Concept	Key Idea
Minimax game	$\min_G \max_D \; \mathbb{E}[\log D(\mathbf{x})] + \mathbb{E}[\log(1-D(G(\mathbf{z})))]$
Optimal discriminator	$D^*(\mathbf{x}) = p_\text{data} / (p_\text{data} + p_g)$ — coin flip when $p_g = p_\text{data}$
JS divergence	Objective equivalent to minimising JS — but has zero gradient when supports are disjoint
Non-saturating loss	Replace $\log(1-D)$ with $-\log D$ for the generator — stronger early gradients
Mode collapse	Generator ignores diversity to exploit D's weaknesses
Wasserstein GAN	Earth mover's distance — smooth gradients everywhere, even with disjoint supports
WGAN-GP	Gradient penalty enforces 1-Lipschitz constraint on the critic
StyleGAN	Mapping network to $\mathcal{W}$ space + AdaIN style modulation per layer
FID	Fréchet distance between Inception features — the standard evaluation metric