My App
Generative Models

Autoregressive Models

How modern language models generate sequences one token at a time — from the chain rule of probability to sampling strategies and causal attention.

Autoregressive Models

A language model answers one fundamental question: given everything that came before, what comes next?

Autoregressive models make this concrete by breaking the joint probability of an entire sequence into a chain of conditional probabilities — each token predicted from all previous tokens. This simple factorization, combined with the Transformer architecture, powers GPT, LLaMA, Mistral, and virtually every major modern language model.


1. The Chain Rule Factorization

The probability of a sequence x=(x1,x2,,xT)\mathbf{x} = (x_1, x_2, \dots, x_T) can always be decomposed exactly using the chain rule of probability:

P(x)=t=1TP(xtx1,x2,,xt1)P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, x_2, \dots, x_{t-1})

This is not an approximation — it is an identity. What makes it useful is that each factor P(xtx<t)P(x_t \mid x_{<t}) can be modelled by a neural network that takes the prefix x<tx_{<t} as input and outputs a probability distribution over the vocabulary.

TermMeaning
x\mathbf{x}The full sequence (x1,,xT)(x_1, \dots, x_T)
xtx_tThe token at position tt
x<tx_{<t}All tokens before position tt (the context)
P(xtx<t)P(x_t \mid x_{<t})The model's next-token distribution

Why This Works

The chain rule requires no independence assumptions. Every token can depend on every preceding token. The model's capacity to capture long-range dependencies comes from the architecture — not from a simplifying assumption.


2. Vocabulary and Token Distributions

Before diving into how the model is trained, it helps to understand what the model actually outputs.

Modern language models operate over a fixed vocabulary V\mathcal{V} — typically 32 000 to 128 000 subword tokens (produced by BPE or SentencePiece tokenisation). At each step, the model outputs a vector of logits zRV\mathbf{z} \in \mathbb{R}^{|\mathcal{V}|}, one per vocabulary entry. A softmax converts these into a valid probability distribution:

P(xt=vx<t)=exp(zv)vVexp(zv)P(x_t = v \mid x_{<t}) = \frac{\exp(z_v)}{\sum_{v' \in \mathcal{V}} \exp(z_{v'})}

Logits vs Probabilities

The quick brown
fox
69.6%
dog
19.0%
bear
6.3%
rabbit
3.5%
cat
1.7%

3. Training: Next-Token Prediction

The training objective for an autoregressive model is cross-entropy loss summed over every position in the sequence:

L(θ)=1Tt=1TlogPθ(xtx1,,xt1)\mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \dots, x_{t-1})

Minimising this loss is equivalent to maximising the log-likelihood of the training corpus. Crucially, a single sequence provides TT training signal pairs — one per position — making the objective extremely data-efficient.

What the Model Sees During Training

At position tt, the model receives the prefix (x1,,xt1)(x_1, \dots, x_{t-1}) and must assign high probability to xtx_t (the ground-truth next token). The gradient of the loss with respect to θ\theta nudges the model to be more confident about correct continuations.

This is known as teacher forcing: during training, the ground-truth prefix is always fed in, regardless of what the model would have predicted.

Teacher Forcing Training

MODEL INPUT $x_<t

The

GROUND TRUTH $y_t$

cat
Calculate Loss
Step 1: The model reads the green tokens and is forced to predict the blue token. If it predicts something else, the cross-entropy loss generates a gradient to correct it.

4. Causal Masking: Preventing Future Leakage

Transformers process all positions in parallel during training, which creates a problem: a naive attention mechanism would let token tt attend to token t+1t+1, allowing the model to "cheat" by reading the answer before predicting it.

Causal masking (also called a look-ahead mask or autoregressive mask) solves this by setting all attention weights for positions j>ij > i to -\infty before the softmax, effectively zeroing them out:

Attention(Q,K,V)=softmax ⁣(QKTdk+M)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right) V

where the mask MM is:

Mij={0if jiif j>iM_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

The result is a lower-triangular attention matrix: each token can attend only to itself and all tokens to its left.

Attention Matrix (Causal Mask)

The
cat
sat
on
the
mat
The
-∞
-∞
-∞
-∞
-∞
cat
-∞
-∞
-∞
-∞
sat
-∞
-∞
-∞
on
-∞
-∞
the
-∞
mat
Token "The" can attend to: The

5. Generation: The Autoregressive Loop

At inference time, the model generates a sequence one token at a time. This is the autoregressive loop:

  1. Start with a prompt (the context tokens x1,,xkx_1, \dots, x_k).
  2. Forward pass — feed the current sequence into the model to get logits z\mathbf{z} at the last position.
  3. Sample — draw the next token x^k+1\hat{x}_{k+1} from the distribution P(xxk)P(x \mid x_{\leq k}).
  4. Append x^k+1\hat{x}_{k+1} to the sequence.
  5. Repeat from step 2 until an end-of-sequence token is sampled or a length limit is reached.

Because each new token depends on all previous ones, generation is inherently sequential — unlike training, it cannot be parallelised across time steps.

KV-Cache

Recomputing attention over the entire prefix at every step would be quadratically expensive. In practice, the key-value cache stores the KK and VV tensors from all previous positions so that each new step only computes attention for the single new token — turning the per-step cost from O(T2)O(T^2) to O(T)O(T).

Autoregressive Loop

The
quick
brown

Predicting Next Token...

fox
85%

6. Sampling Strategies

How we draw from P(xtx<t)P(x_t \mid x_{<t}) at inference time dramatically affects the quality and diversity of generated text. The choice is a trade-off between coherence and creativity.

Greedy Decoding

Always select the most probable token:

x^t=argmaxvVP(xt=vx<t)\hat{x}_t = \arg\max_{v \in \mathcal{V}} P(x_t = v \mid x_{<t})

Simple and deterministic, but prone to repetitive, degenerate outputs since it commits fully to the local maximum at every step.


Temperature Scaling

Before the softmax, divide the logits by a temperature parameter τ>0\tau > 0:

Pτ(xt=vx<t)=exp(zv/τ)vexp(zv/τ)P_\tau(x_t = v \mid x_{<t}) = \frac{\exp(z_v / \tau)}{\sum_{v'} \exp(z_{v'} / \tau)}
τ\tauEffect
τ0\tau \to 0Converges to greedy — almost all probability on the top token
τ=1\tau = 1Default — model's original distribution
τ>1\tau > 1Flatter distribution — more diverse, more random output

Top-kk Sampling

Restrict sampling to the kk most probable tokens, renormalising over them:

Pk(xt=vx<t)P(xt=vx<t)1[vTop-k]P_k(x_t = v \mid x_{<t}) \propto P(x_t = v \mid x_{<t}) \cdot \mathbf{1}[v \in \text{Top-}k]

This cuts off the long tail of unlikely tokens. Common values: k=40k = 40 to k=100k = 100.


Nucleus (Top-pp) Sampling

Rather than a fixed kk, keep the smallest set of tokens whose cumulative probability exceeds pp:

Vp=smallest SV s.t. vSP(xt=vx<t)p\mathcal{V}_p = \text{smallest } S \subseteq \mathcal{V} \text{ s.t. } \sum_{v \in S} P(x_t = v \mid x_{<t}) \geq p

The cutoff adapts dynamically: when the distribution is peaked, only a handful of tokens are included; when it is flat, more tokens enter the nucleus.

In Practice

Most modern APIs combine temperature + top-p: e.g., temperature=0.7, top_p=0.9. Pure greedy decoding is rarely used in production text generation.

Controls
Temperature (τ) 1.00

Resulting Probability Distribution

54%
fox
27%
dog
13%
bear
4%
cat
2%
wolf
0%
lion
0%
zebra

7. Exposure Bias: Training vs. Inference Gap

A subtle but important mismatch exists between training and generation.

During training (teacher forcing), the model always receives the ground-truth prefix. During inference, it conditions on its own previous predictions. If the model makes an error at step tt, that error becomes part of the context for step t+1t+1, and mistakes can compound — a problem called exposure bias.

PhaseContext fed to the modelProblem
TrainingGround-truth x<tx_{<t}Never sees its own mistakes
InferencePredicted x^<t\hat{x}_{<t}Small errors compound over long sequences

Approaches to mitigate this include scheduled sampling (gradually replacing ground-truth tokens with predicted ones during training) and RLHF fine-tuning, which allows the model to experience the consequences of its own generations.


8. Why Autoregressive Models Scale

Autoregressive modelling with Transformers has a remarkable property: the training objective requires no labelled data. Any raw text corpus is a valid training set — the supervision signal comes entirely from predicting the next token in the corpus itself. This is self-supervised learning.

Combined with the Transformer's parallelism across sequence positions (enabled by causal masking), this means:

  • Data: Any text on the internet is usable.
  • Compute: Each GPU processes all TT positions simultaneously during training.
  • Signal: A single document of TT tokens generates TT gradient-informative predictions.

The result is a training paradigm that scales predictably with model size, data, and compute — the empirical observation captured by the scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020).


9. Beyond Text: Autoregressive Generation in Other Domains

The autoregressive factorisation is not limited to text. The same principle applies to any domain where data can be serialised into a discrete sequence:

DomainModelTokens
TextGPT, LLaMABPE subwords
ImagesPixelCNN, ImageGPTPixel values or VQ-VAE codes
AudioWaveNetQuantised waveform samples
CodeCodex, DeepSeek-CoderCode tokens
ProteinsProtGPT2Amino acid residues

In each case, the model learns P(xtx<t)P(x_t \mid x_{<t}) and generates by sampling token-by-token in the appropriate discrete space.


10. From API Call to Token Stream: The SDK Pipeline

When you call a chat API, the request you send bears no resemblance to what the model actually processes. Your JSON message objects go through three silent transformation steps before a single forward pass occurs.

Step 1 — JSON input. The SDK accepts a structured list of { role, content } objects — a developer-friendly abstraction.

Step 2 — Chat template. The SDK serialises those objects into a single raw string using a model-specific format (e.g. ChatML, Llama-chat, Gemma). Special boundary tokens like <|im_start|> and <|im_end|> mark speaker turns. The model was trained to expect exactly this format.

Step 3 — Tokenisation. The formatted string is split into integer token IDs. These IDs — not strings — are the model's actual input.

Step 4 — Autoregressive generation. The model runs the loop described in §5, sampling one token at a time until it emits a stop token. The SDK reassembles the output IDs back into text for you.

SDK Request Pipeline

1. API Input (JSON)
2. Chat Template applied
3. Tokenisation
4. Model Inference
const messages = [
  {
    "role": "user",
    "content": "Say hi"
  }
];
Developers pass simple JSON objects to the SDK. The model itself doesn't understand JSON objects natively.

This pipeline is invisible in typical API usage, but understanding it clarifies why prompt format matters, why different models require different templates, and why "tokens" are the natural unit of cost and context length.


Summary

ConceptKey Idea
Chain rule factorisationP(x)=tP(xtx<t)P(\mathbf{x}) = \prod_t P(x_t \mid x_{<t}) — exact, no assumptions
Next-token predictionSingle self-supervised objective covers all positions
Causal maskingLower-triangular attention prevents future leakage during training
Autoregressive loopToken-by-token generation at inference time
Sampling strategiesTemperature, top-kk, top-pp trade coherence for diversity
Exposure biasTrain on truth, generate from self — errors can compound
ScalabilitySelf-supervised + parallel training = scales with data and compute

On this page