How modern language models generate sequences one token at a time — from the chain rule of probability to sampling strategies and causal attention.

Autoregressive Models

A language model answers one fundamental question: given everything that came before, what comes next?

Autoregressive models make this concrete by breaking the joint probability of an entire sequence into a chain of conditional probabilities — each token predicted from all previous tokens. This simple factorization, combined with the Transformer architecture, powers GPT, LLaMA, Mistral, and virtually every major modern language model.

1. The Chain Rule Factorization

The probability of a sequence $\mathbf{x} = (x_1, x_2, \dots, x_T)$ can always be decomposed exactly using the chain rule of probability:

P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, x_2, \dots, x_{t-1})

This is not an approximation — it is an identity. What makes it useful is that each factor $P(x_t \mid x_{<t})$ can be modelled by a neural network that takes the prefix $x_{<t}$ as input and outputs a probability distribution over the vocabulary.

Term	Meaning
$\mathbf{x}$	The full sequence $(x_1, \dots, x_T)$
$x_t$	The token at position $t$
$x_{<t}$	All tokens before position $t$ (the context)
$P(x_t \mid x_{<t})$	The model's next-token distribution

Why This Works

The chain rule requires no independence assumptions. Every token can depend on every preceding token. The model's capacity to capture long-range dependencies comes from the architecture — not from a simplifying assumption.

2. Vocabulary and Token Distributions

Before diving into how the model is trained, it helps to understand what the model actually outputs.

Modern language models operate over a fixed vocabulary $\mathcal{V}$ — typically 32 000 to 128 000 subword tokens (produced by BPE or SentencePiece tokenisation). At each step, the model outputs a vector of logits $\mathbf{z} \in \mathbb{R}^{|\mathcal{V}|}$ , one per vocabulary entry. A softmax converts these into a valid probability distribution:

P(x_t = v \mid x_{<t}) = \frac{\exp(z_v)}{\sum_{v' \in \mathcal{V}} \exp(z_{v'})}

Logits vs Probabilities

PROMPT CONTEXT

The quick brown

fox

69.6%

dog

19.0%

bear

6.3%

rabbit

3.5%

cat

1.7%

3. Training: Next-Token Prediction

The training objective for an autoregressive model is cross-entropy loss summed over every position in the sequence:

\mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \dots, x_{t-1})

Minimising this loss is equivalent to maximising the log-likelihood of the training corpus. Crucially, a single sequence provides $T$ training signal pairs — one per position — making the objective extremely data-efficient.

What the Model Sees During Training

At position $t$ , the model receives the prefix $(x_1, \dots, x_{t-1})$ and must assign high probability to $x_t$ (the ground-truth next token). The gradient of the loss with respect to $\theta$ nudges the model to be more confident about correct continuations.

This is known as teacher forcing: during training, the ground-truth prefix is always fed in, regardless of what the model would have predicted.

Teacher Forcing Training

MODEL INPUT $x_<t

The

GROUND TRUTH $y_t$

cat

Calculate Loss ↓

Step 1: The model reads the green tokens and is forced to predict the blue token. If it predicts something else, the cross-entropy loss generates a gradient to correct it.

4. Causal Masking: Preventing Future Leakage

Transformers process all positions in parallel during training, which creates a problem: a naive attention mechanism would let token $t$ attend to token $t+1$ , allowing the model to "cheat" by reading the answer before predicting it.

Causal masking (also called a look-ahead mask or autoregressive mask) solves this by setting all attention weights for positions $j > i$ to $-\infty$ before the softmax, effectively zeroing them out:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right) V

where the mask $M$ is:

M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

The result is a lower-triangular attention matrix: each token can attend only to itself and all tokens to its left.

Attention Matrix (Causal Mask)

The

cat

sat

the

mat

The

✓

-∞

cat

-∞

sat

-∞

the

-∞

mat

Token "The" can attend to: The

5. Generation: The Autoregressive Loop

At inference time, the model generates a sequence one token at a time. This is the autoregressive loop:

Start with a prompt (the context tokens $x_1, \dots, x_k$ ).
Forward pass — feed the current sequence into the model to get logits $\mathbf{z}$ at the last position.
Sample — draw the next token $\hat{x}_{k+1}$ from the distribution $P(x \mid x_{\leq k})$ .
Append $\hat{x}_{k+1}$ to the sequence.
Repeat from step 2 until an end-of-sequence token is sampled or a length limit is reached.

Because each new token depends on all previous ones, generation is inherently sequential — unlike training, it cannot be parallelised across time steps.

KV-Cache

Recomputing attention over the entire prefix at every step would be quadratically expensive. In practice, the key-value cache stores the $K$ and $V$ tensors from all previous positions so that each new step only computes attention for the single new token — turning the per-step cost from $O(T^2)$ to $O(T)$ .

Autoregressive Loop

The

quick

brown

Predicting Next Token...

fox

85%

6. Sampling Strategies

How we draw from $P(x_t \mid x_{<t})$ at inference time dramatically affects the quality and diversity of generated text. The choice is a trade-off between coherence and creativity.

Greedy Decoding

Always select the most probable token:

\hat{x}_t = \arg\max_{v \in \mathcal{V}} P(x_t = v \mid x_{<t})

Simple and deterministic, but prone to repetitive, degenerate outputs since it commits fully to the local maximum at every step.

Temperature Scaling

Before the softmax, divide the logits by a temperature parameter $\tau > 0$ :

P_\tau(x_t = v \mid x_{<t}) = \frac{\exp(z_v / \tau)}{\sum_{v'} \exp(z_{v'} / \tau)}

$\tau$	Effect
$\tau \to 0$	Converges to greedy — almost all probability on the top token
$\tau = 1$	Default — model's original distribution
$\tau > 1$	Flatter distribution — more diverse, more random output

Top- $k$ Sampling

Restrict sampling to the $k$ most probable tokens, renormalising over them:

P_k(x_t = v \mid x_{<t}) \propto P(x_t = v \mid x_{<t}) \cdot \mathbf{1}[v \in \text{Top-}k]

This cuts off the long tail of unlikely tokens. Common values: $k = 40$ to $k = 100$ .

Nucleus (Top- $p$ ) Sampling

Rather than a fixed $k$ , keep the smallest set of tokens whose cumulative probability exceeds $p$ :

\mathcal{V}_p = \text{smallest } S \subseteq \mathcal{V} \text{ s.t. } \sum_{v \in S} P(x_t = v \mid x_{<t}) \geq p

The cutoff adapts dynamically: when the distribution is peaked, only a handful of tokens are included; when it is flat, more tokens enter the nucleus.

In Practice

Most modern APIs combine temperature + top-p: e.g., temperature=0.7, top_p=0.9. Pure greedy decoding is rarely used in production text generation.

Controls

Temperature (τ) 1.00

Resulting Probability Distribution

54%

fox

27%

dog

13%

bear

cat

wolf

lion

zebra

7. Exposure Bias: Training vs. Inference Gap

A subtle but important mismatch exists between training and generation.

During training (teacher forcing), the model always receives the ground-truth prefix. During inference, it conditions on its own previous predictions. If the model makes an error at step $t$ , that error becomes part of the context for step $t+1$ , and mistakes can compound — a problem called exposure bias.

Phase	Context fed to the model	Problem
Training	Ground-truth $x_{<t}$	Never sees its own mistakes
Inference	Predicted $\hat{x}_{<t}$	Small errors compound over long sequences

Approaches to mitigate this include scheduled sampling (gradually replacing ground-truth tokens with predicted ones during training) and RLHF fine-tuning, which allows the model to experience the consequences of its own generations.

8. Why Autoregressive Models Scale

Autoregressive modelling with Transformers has a remarkable property: the training objective requires no labelled data. Any raw text corpus is a valid training set — the supervision signal comes entirely from predicting the next token in the corpus itself. This is self-supervised learning.

Combined with the Transformer's parallelism across sequence positions (enabled by causal masking), this means:

Data: Any text on the internet is usable.
Compute: Each GPU processes all $T$ positions simultaneously during training.
Signal: A single document of $T$ tokens generates $T$ gradient-informative predictions.

The result is a training paradigm that scales predictably with model size, data, and compute — the empirical observation captured by the scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020).

9. Beyond Text: Autoregressive Generation in Other Domains

The autoregressive factorisation is not limited to text. The same principle applies to any domain where data can be serialised into a discrete sequence:

Domain	Model	Tokens
Text	GPT, LLaMA	BPE subwords
Images	PixelCNN, ImageGPT	Pixel values or VQ-VAE codes
Audio	WaveNet	Quantised waveform samples
Code	Codex, DeepSeek-Coder	Code tokens
Proteins	ProtGPT2	Amino acid residues

In each case, the model learns $P(x_t \mid x_{<t})$ and generates by sampling token-by-token in the appropriate discrete space.

10. From API Call to Token Stream: The SDK Pipeline

When you call a chat API, the request you send bears no resemblance to what the model actually processes. Your JSON message objects go through three silent transformation steps before a single forward pass occurs.

Step 1 — JSON input. The SDK accepts a structured list of { role, content } objects — a developer-friendly abstraction.

Step 2 — Chat template. The SDK serialises those objects into a single raw string using a model-specific format (e.g. ChatML, Llama-chat, Gemma). Special boundary tokens like <|im_start|> and <|im_end|> mark speaker turns. The model was trained to expect exactly this format.

Step 3 — Tokenisation. The formatted string is split into integer token IDs. These IDs — not strings — are the model's actual input.

Step 4 — Autoregressive generation. The model runs the loop described in §5, sampling one token at a time until it emits a stop token. The SDK reassembles the output IDs back into text for you.

SDK Request Pipeline

1. API Input (JSON)

2. Chat Template applied

3. Tokenisation

4. Model Inference

const messages = [
  {
    "role": "user",
    "content": "Say hi"
  }
];

Developers pass simple JSON objects to the SDK. The model itself doesn't understand JSON objects natively.

This pipeline is invisible in typical API usage, but understanding it clarifies why prompt format matters, why different models require different templates, and why "tokens" are the natural unit of cost and context length.

Summary

Concept	Key Idea
Chain rule factorisation	$P(\mathbf{x}) = \prod_t P(x_t \mid x_{<t})$ — exact, no assumptions
Next-token prediction	Single self-supervised objective covers all positions
Causal masking	Lower-triangular attention prevents future leakage during training
Autoregressive loop	Token-by-token generation at inference time
Sampling strategies	Temperature, top- $k$ , top- $p$ trade coherence for diversity
Exposure bias	Train on truth, generate from self — errors can compound
Scalability	Self-supervised + parallel training = scales with data and compute

Autoregressive Models

On this page