Tokenization — How Language Models Read Text

A deep dive into subword tokenization, Byte-Pair Encoding, and vocabulary lookup — the first stage of every language model.

What is Tokenization?

Language models cannot read raw text. Before any computation happens, text must be converted into a sequence of integers — a format tensors can hold and GPUs can process.

Tokenization is that conversion. A tokenizer:

Splits raw text into discrete units called tokens
Maps each token to an integer ID from a fixed vocabulary
Produces a tensor of IDs that the model embeds and processes

The token is not always a full word. Most modern tokenizers operate at the subword level — splitting words into meaningful fragments based on corpus frequency.

Raw Input

“transformers are powerful”

↓

Subword Tokens

The sentence "transformers are powerful" split into subword tokens, each assigned a vocabulary ID.

Why Subword Tokenization?

Three approaches exist, each with a critical flaw if used alone:

Strategy	Example	Problem
Character-level	`t`, `r`, `a`, `i`, `n`	Sequences become very long; hard to learn meaning
Word-level	`train`, `training` → separate	Vocabulary explodes; rare words are `[UNK]`
Subword-level	`train`, `##ing`	Balances length and coverage ✓

Subword tokenization is a compression algorithm applied to a training corpus. Frequent sequences stay whole; infrequent ones are decomposed into smaller parts that the model has seen before.

Key Insight

A well-designed vocabulary handles unknown words gracefully — they decompose into known subwords rather than collapsing to a single [UNK] token that loses all information.

Byte-Pair Encoding (BPE)

BPE is the tokenization algorithm behind GPT-2, GPT-3, LLaMA, and most modern LLMs. It was originally a data compression technique adapted for NLP by Sennrich et al. (2016).

The Algorithm

Initialize — split every word in the corpus into individual characters. Add a special end-of-word marker if needed. Build a frequency table.

Count pairs — scan all adjacent symbol pairs across the corpus and count how often each pair co-occurs.

Merge the most frequent pair — replace every occurrence of the most common pair with a new merged symbol. Add this pair to the merge table.

Repeat until the vocabulary reaches the target size (e.g. 50,000 for GPT-2).

The result is a merge table — an ordered list of pair merges. At inference time, the same merges are applied in the same order to tokenize new text.

Step 0: Initial characters — scan all adjacent pairs in the corpus

Corpus

low

lower

newer

wider

Pair Counts

e·rmax3

l·o2

o·w2

w·e2

Merge Table

No merges yet…

Step 1 / 4

BPE merge steps on a small corpus. Highlighted tokens are the pair being merged in each step.

Mathematical View

Let $V_0 = \Sigma^*$ be the initial character vocabulary. After $n$ merge operations, we have a vocabulary $V_n$ . Each merge operation:

V_{k+1} = V_k \cup \{ab\} \quad \text{where } (a, b) = \underset{(a,b) \in V_k \times V_k}{\arg\max}\; \text{freq}(ab)

The final tokenization of a string $s$ is determined by applying the learned merge rules greedily in priority order.

Vocabulary Lookup

Once the text is split into tokens, each token is mapped to an integer ID via a vocabulary table — a simple hash map.

\text{encode}: \text{token} \rightarrow \mathbb{Z} \qquad \text{decode}: \mathbb{Z} \rightarrow \text{token}

Special tokens are reserved at fixed positions in every vocabulary:

Token	ID	Purpose
`[BOS]` / `<s>`	1	Beginning of sequence
`[EOS]` / `</s>`	2	End of sequence
`[PAD]`	0	Padding to uniform length
`[UNK]`	3	Unknown / out-of-vocabulary
`[MASK]`	4	Masked token (BERT-style)

Token Sequence

[BOS]

transform

ers

Ġare

Ġpower

ful

[EOS]

→

Vocabulary Table

[BOS]1

transform4300

ers1158

Ġare389

Ġpower1917

ful913

[EOS]2

Token ID

Each token is looked up in the vocabulary table to produce its integer ID, which is then passed to the embedding layer.

From IDs to Embeddings

The token ID sequence is still just a list of integers — no semantic content. The embedding layer converts each ID into a dense vector:

\mathbf{e}_i = E[t_i] \quad \text{where } E \in \mathbb{R}^{|V| \times d}

$|V|$ : vocabulary size (e.g. 50,257 for GPT-2)
$d$ : embedding dimension (e.g. 768 for GPT-2 base)
$E[t_i]$ : the $t_i$ -th row of the embedding matrix

The embedding matrix $E$ is a learned parameter — its rows are updated during training via backpropagation.

Other Tokenization Algorithms

BPE is the most common, but two other algorithms are widely used:

WordPiece (BERT, RoBERTa)

WordPiece also iteratively merges pairs, but the merge criterion is different. Instead of raw frequency, it maximises the likelihood of the training data under the language model:

\text{score}(a, b) = \frac{\text{freq}(ab)}{\text{freq}(a) \times \text{freq}(b)}

Tokens not at the start of a word are prefixed with ## (e.g. ##ing).

Unigram Language Model (SentencePiece / T5 / ALBERT)

Unigram starts with a large candidate vocabulary and prunes it. Each token's probability is modelled independently:

P(x) = \prod_{i=1}^{n} p(t_i), \quad \sum_{t \in V} p(t) = 1

Tokens with the lowest marginal likelihood are removed iteratively until the target vocabulary size is reached.

Algorithm	Direction	Criterion	Used in
BPE	Bottom-up (merge)	Frequency	GPT-2, LLaMA, Falcon
WordPiece	Bottom-up (merge)	Likelihood ratio	BERT, RoBERTa
Unigram LM	Top-down (prune)	Marginal likelihood	T5, ALBERT, mBART

Interactive Tokenizer

Try typing your own text below. The tokenizer applies a simplified BPE-style vocabulary to split your input into subword tokens and assign IDs.

Type any text

Tokens (8)

[BOS]transformersarepowerful[EOS]

Token IDs

[1, 300, 301, 303, 4, 500, 501, 2]

chars: 23

tokens: 8

ratio: 2.7 t/word

A simplified subword tokenizer. Notice how compound words decompose into known subword pieces, and the token-to-word ratio changes with vocabulary coverage.

Real tokenizers are larger

Production tokenizers (GPT-2: 50,257 tokens; LLaMA-3: 128,256 tokens) have far richer vocabularies. The interactive demo above uses a small illustrative vocabulary — most words will decompose into character-level pieces.

Summary

Concept	What it does
Tokenizer	Splits text into discrete subword units
BPE	Learns merges from corpus frequency; builds vocabulary bottom-up
Vocabulary table	Maps token string → integer ID
Embedding layer	Maps integer ID → dense learned vector

Tokenization is the first and most constrained stage of any language model pipeline. The vocabulary is fixed after training — anything outside it must be decomposed. Understanding this boundary is essential for debugging unexpected model behaviour on rare words, numbers, code, and multilingual text.

Tokenization — How Language Models Read Text

On this page