Autoregressive Models
How modern language models generate sequences one token at a time — from the chain rule of probability to sampling strategies and causal attention.
Autoregressive Models
A language model answers one fundamental question: given everything that came before, what comes next?
Autoregressive models make this concrete by breaking the joint probability of an entire sequence into a chain of conditional probabilities — each token predicted from all previous tokens. This simple factorization, combined with the Transformer architecture, powers GPT, LLaMA, Mistral, and virtually every major modern language model.
1. The Chain Rule Factorization
The probability of a sequence can always be decomposed exactly using the chain rule of probability:
This is not an approximation — it is an identity. What makes it useful is that each factor can be modelled by a neural network that takes the prefix as input and outputs a probability distribution over the vocabulary.
| Term | Meaning |
|---|---|
| The full sequence | |
| The token at position | |
| All tokens before position (the context) | |
| The model's next-token distribution |
Why This Works
The chain rule requires no independence assumptions. Every token can depend on every preceding token. The model's capacity to capture long-range dependencies comes from the architecture — not from a simplifying assumption.
2. Vocabulary and Token Distributions
Before diving into how the model is trained, it helps to understand what the model actually outputs.
Modern language models operate over a fixed vocabulary — typically 32 000 to 128 000 subword tokens (produced by BPE or SentencePiece tokenisation). At each step, the model outputs a vector of logits , one per vocabulary entry. A softmax converts these into a valid probability distribution:
Logits vs Probabilities
3. Training: Next-Token Prediction
The training objective for an autoregressive model is cross-entropy loss summed over every position in the sequence:
Minimising this loss is equivalent to maximising the log-likelihood of the training corpus. Crucially, a single sequence provides training signal pairs — one per position — making the objective extremely data-efficient.
What the Model Sees During Training
At position , the model receives the prefix and must assign high probability to (the ground-truth next token). The gradient of the loss with respect to nudges the model to be more confident about correct continuations.
This is known as teacher forcing: during training, the ground-truth prefix is always fed in, regardless of what the model would have predicted.
Teacher Forcing Training
MODEL INPUT $x_<t
GROUND TRUTH $y_t$
4. Causal Masking: Preventing Future Leakage
Transformers process all positions in parallel during training, which creates a problem: a naive attention mechanism would let token attend to token , allowing the model to "cheat" by reading the answer before predicting it.
Causal masking (also called a look-ahead mask or autoregressive mask) solves this by setting all attention weights for positions to before the softmax, effectively zeroing them out:
where the mask is:
The result is a lower-triangular attention matrix: each token can attend only to itself and all tokens to its left.
Attention Matrix (Causal Mask)
5. Generation: The Autoregressive Loop
At inference time, the model generates a sequence one token at a time. This is the autoregressive loop:
- Start with a prompt (the context tokens ).
- Forward pass — feed the current sequence into the model to get logits at the last position.
- Sample — draw the next token from the distribution .
- Append to the sequence.
- Repeat from step 2 until an end-of-sequence token is sampled or a length limit is reached.
Because each new token depends on all previous ones, generation is inherently sequential — unlike training, it cannot be parallelised across time steps.
KV-Cache
Recomputing attention over the entire prefix at every step would be quadratically expensive. In practice, the key-value cache stores the and tensors from all previous positions so that each new step only computes attention for the single new token — turning the per-step cost from to .
Autoregressive Loop
Predicting Next Token...
6. Sampling Strategies
How we draw from at inference time dramatically affects the quality and diversity of generated text. The choice is a trade-off between coherence and creativity.
Greedy Decoding
Always select the most probable token:
Simple and deterministic, but prone to repetitive, degenerate outputs since it commits fully to the local maximum at every step.
Temperature Scaling
Before the softmax, divide the logits by a temperature parameter :
| Effect | |
|---|---|
| Converges to greedy — almost all probability on the top token | |
| Default — model's original distribution | |
| Flatter distribution — more diverse, more random output |
Top- Sampling
Restrict sampling to the most probable tokens, renormalising over them:
This cuts off the long tail of unlikely tokens. Common values: to .
Nucleus (Top-) Sampling
Rather than a fixed , keep the smallest set of tokens whose cumulative probability exceeds :
The cutoff adapts dynamically: when the distribution is peaked, only a handful of tokens are included; when it is flat, more tokens enter the nucleus.
In Practice
Most modern APIs combine temperature + top-p: e.g., temperature=0.7, top_p=0.9. Pure greedy decoding is rarely used in production text generation.
Resulting Probability Distribution
7. Exposure Bias: Training vs. Inference Gap
A subtle but important mismatch exists between training and generation.
During training (teacher forcing), the model always receives the ground-truth prefix. During inference, it conditions on its own previous predictions. If the model makes an error at step , that error becomes part of the context for step , and mistakes can compound — a problem called exposure bias.
| Phase | Context fed to the model | Problem |
|---|---|---|
| Training | Ground-truth | Never sees its own mistakes |
| Inference | Predicted | Small errors compound over long sequences |
Approaches to mitigate this include scheduled sampling (gradually replacing ground-truth tokens with predicted ones during training) and RLHF fine-tuning, which allows the model to experience the consequences of its own generations.
8. Why Autoregressive Models Scale
Autoregressive modelling with Transformers has a remarkable property: the training objective requires no labelled data. Any raw text corpus is a valid training set — the supervision signal comes entirely from predicting the next token in the corpus itself. This is self-supervised learning.
Combined with the Transformer's parallelism across sequence positions (enabled by causal masking), this means:
- Data: Any text on the internet is usable.
- Compute: Each GPU processes all positions simultaneously during training.
- Signal: A single document of tokens generates gradient-informative predictions.
The result is a training paradigm that scales predictably with model size, data, and compute — the empirical observation captured by the scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020).
9. Beyond Text: Autoregressive Generation in Other Domains
The autoregressive factorisation is not limited to text. The same principle applies to any domain where data can be serialised into a discrete sequence:
| Domain | Model | Tokens |
|---|---|---|
| Text | GPT, LLaMA | BPE subwords |
| Images | PixelCNN, ImageGPT | Pixel values or VQ-VAE codes |
| Audio | WaveNet | Quantised waveform samples |
| Code | Codex, DeepSeek-Coder | Code tokens |
| Proteins | ProtGPT2 | Amino acid residues |
In each case, the model learns and generates by sampling token-by-token in the appropriate discrete space.
10. From API Call to Token Stream: The SDK Pipeline
When you call a chat API, the request you send bears no resemblance to what the model actually processes. Your JSON message objects go through three silent transformation steps before a single forward pass occurs.
Step 1 — JSON input. The SDK accepts a structured list of { role, content } objects — a developer-friendly abstraction.
Step 2 — Chat template. The SDK serialises those objects into a single raw string using a model-specific format (e.g. ChatML, Llama-chat, Gemma). Special boundary tokens like <|im_start|> and <|im_end|> mark speaker turns. The model was trained to expect exactly this format.
Step 3 — Tokenisation. The formatted string is split into integer token IDs. These IDs — not strings — are the model's actual input.
Step 4 — Autoregressive generation. The model runs the loop described in §5, sampling one token at a time until it emits a stop token. The SDK reassembles the output IDs back into text for you.
SDK Request Pipeline
{
"role": "user",
"content": "Say hi"
}
];
This pipeline is invisible in typical API usage, but understanding it clarifies why prompt format matters, why different models require different templates, and why "tokens" are the natural unit of cost and context length.
Summary
| Concept | Key Idea |
|---|---|
| Chain rule factorisation | — exact, no assumptions |
| Next-token prediction | Single self-supervised objective covers all positions |
| Causal masking | Lower-triangular attention prevents future leakage during training |
| Autoregressive loop | Token-by-token generation at inference time |
| Sampling strategies | Temperature, top-, top- trade coherence for diversity |
| Exposure bias | Train on truth, generate from self — errors can compound |
| Scalability | Self-supervised + parallel training = scales with data and compute |
Tokens to Embeddings — Giving Numbers Meaning
How integer token IDs become dense learned vectors that carry semantic meaning — the embedding layer explained from lookup table to positional encoding.
Normalizing Flows
How invertible neural networks learn exact probability distributions — from the change-of-variables formula to modern coupling layers and continuous flows.