Tokenization — How Language Models Read Text
A deep dive into subword tokenization, Byte-Pair Encoding, and vocabulary lookup — the first stage of every language model.
What is Tokenization?
Language models cannot read raw text. Before any computation happens, text must be converted into a sequence of integers — a format tensors can hold and GPUs can process.
Tokenization is that conversion. A tokenizer:
- Splits raw text into discrete units called tokens
- Maps each token to an integer ID from a fixed vocabulary
- Produces a tensor of IDs that the model embeds and processes
The token is not always a full word. Most modern tokenizers operate at the subword level — splitting words into meaningful fragments based on corpus frequency.
Raw Input
Subword Tokens
The sentence "transformers are powerful" split into subword tokens, each assigned a vocabulary ID.
Why Subword Tokenization?
Three approaches exist, each with a critical flaw if used alone:
| Strategy | Example | Problem |
|---|---|---|
| Character-level | t, r, a, i, n | Sequences become very long; hard to learn meaning |
| Word-level | train, training → separate | Vocabulary explodes; rare words are [UNK] |
| Subword-level | train, ##ing | Balances length and coverage ✓ |
Subword tokenization is a compression algorithm applied to a training corpus. Frequent sequences stay whole; infrequent ones are decomposed into smaller parts that the model has seen before.
Key Insight
A well-designed vocabulary handles unknown words gracefully — they decompose into known subwords rather than collapsing to a single [UNK] token that loses all information.
Byte-Pair Encoding (BPE)
BPE is the tokenization algorithm behind GPT-2, GPT-3, LLaMA, and most modern LLMs. It was originally a data compression technique adapted for NLP by Sennrich et al. (2016).
The Algorithm
Initialize — split every word in the corpus into individual characters. Add a special end-of-word marker if needed. Build a frequency table.
Count pairs — scan all adjacent symbol pairs across the corpus and count how often each pair co-occurs.
Merge the most frequent pair — replace every occurrence of the most common pair with a new merged symbol. Add this pair to the merge table.
Repeat until the vocabulary reaches the target size (e.g. 50,000 for GPT-2).
The result is a merge table — an ordered list of pair merges. At inference time, the same merges are applied in the same order to tokenize new text.
Step 0: Initial characters — scan all adjacent pairs in the corpus
Corpus
Pair Counts
Merge Table
No merges yet…
BPE merge steps on a small corpus. Highlighted tokens are the pair being merged in each step.
Mathematical View
Let be the initial character vocabulary. After merge operations, we have a vocabulary . Each merge operation:
The final tokenization of a string is determined by applying the learned merge rules greedily in priority order.
Vocabulary Lookup
Once the text is split into tokens, each token is mapped to an integer ID via a vocabulary table — a simple hash map.
Special tokens are reserved at fixed positions in every vocabulary:
| Token | ID | Purpose |
|---|---|---|
[BOS] / <s> | 1 | Beginning of sequence |
[EOS] / </s> | 2 | End of sequence |
[PAD] | 0 | Padding to uniform length |
[UNK] | 3 | Unknown / out-of-vocabulary |
[MASK] | 4 | Masked token (BERT-style) |
Token Sequence
Vocabulary Table
Token ID
Each token is looked up in the vocabulary table to produce its integer ID, which is then passed to the embedding layer.
From IDs to Embeddings
The token ID sequence is still just a list of integers — no semantic content. The embedding layer converts each ID into a dense vector:
- : vocabulary size (e.g. 50,257 for GPT-2)
- : embedding dimension (e.g. 768 for GPT-2 base)
- : the -th row of the embedding matrix
The embedding matrix is a learned parameter — its rows are updated during training via backpropagation.
Other Tokenization Algorithms
BPE is the most common, but two other algorithms are widely used:
WordPiece (BERT, RoBERTa)
WordPiece also iteratively merges pairs, but the merge criterion is different. Instead of raw frequency, it maximises the likelihood of the training data under the language model:
Tokens not at the start of a word are prefixed with ## (e.g. ##ing).
Unigram Language Model (SentencePiece / T5 / ALBERT)
Unigram starts with a large candidate vocabulary and prunes it. Each token's probability is modelled independently:
Tokens with the lowest marginal likelihood are removed iteratively until the target vocabulary size is reached.
| Algorithm | Direction | Criterion | Used in |
|---|---|---|---|
| BPE | Bottom-up (merge) | Frequency | GPT-2, LLaMA, Falcon |
| WordPiece | Bottom-up (merge) | Likelihood ratio | BERT, RoBERTa |
| Unigram LM | Top-down (prune) | Marginal likelihood | T5, ALBERT, mBART |
Interactive Tokenizer
Try typing your own text below. The tokenizer applies a simplified BPE-style vocabulary to split your input into subword tokens and assign IDs.
Tokens (8)
Token IDs
A simplified subword tokenizer. Notice how compound words decompose into known subword pieces, and the token-to-word ratio changes with vocabulary coverage.
Real tokenizers are larger
Production tokenizers (GPT-2: 50,257 tokens; LLaMA-3: 128,256 tokens) have far richer vocabularies. The interactive demo above uses a small illustrative vocabulary — most words will decompose into character-level pieces.
Summary
| Concept | What it does |
|---|---|
| Tokenizer | Splits text into discrete subword units |
| BPE | Learns merges from corpus frequency; builds vocabulary bottom-up |
| Vocabulary table | Maps token string → integer ID |
| Embedding layer | Maps integer ID → dense learned vector |
Tokenization is the first and most constrained stage of any language model pipeline. The vocabulary is fixed after training — anything outside it must be decomposed. Understanding this boundary is essential for debugging unexpected model behaviour on rare words, numbers, code, and multilingual text.
Positional Embeddings in Transformers
A deep dive into absolute, relative, sine/cosine, and rotational positional embeddings — with interactive playgrounds.
Tokens to Embeddings — Giving Numbers Meaning
How integer token IDs become dense learned vectors that carry semantic meaning — the embedding layer explained from lookup table to positional encoding.