My App
Deep Neural Networks

Tokenization — How Language Models Read Text

A deep dive into subword tokenization, Byte-Pair Encoding, and vocabulary lookup — the first stage of every language model.

What is Tokenization?

Language models cannot read raw text. Before any computation happens, text must be converted into a sequence of integers — a format tensors can hold and GPUs can process.

Tokenization is that conversion. A tokenizer:

  1. Splits raw text into discrete units called tokens
  2. Maps each token to an integer ID from a fixed vocabulary
  3. Produces a tensor of IDs that the model embeds and processes

The token is not always a full word. Most modern tokenizers operate at the subword level — splitting words into meaningful fragments based on corpus frequency.

Raw Input

transformers are powerful

Subword Tokens

The sentence "transformers are powerful" split into subword tokens, each assigned a vocabulary ID.


Why Subword Tokenization?

Three approaches exist, each with a critical flaw if used alone:

StrategyExampleProblem
Character-levelt, r, a, i, nSequences become very long; hard to learn meaning
Word-leveltrain, training → separateVocabulary explodes; rare words are [UNK]
Subword-leveltrain, ##ingBalances length and coverage ✓

Subword tokenization is a compression algorithm applied to a training corpus. Frequent sequences stay whole; infrequent ones are decomposed into smaller parts that the model has seen before.

Key Insight

A well-designed vocabulary handles unknown words gracefully — they decompose into known subwords rather than collapsing to a single [UNK] token that loses all information.


Byte-Pair Encoding (BPE)

BPE is the tokenization algorithm behind GPT-2, GPT-3, LLaMA, and most modern LLMs. It was originally a data compression technique adapted for NLP by Sennrich et al. (2016).

The Algorithm

Initialize — split every word in the corpus into individual characters. Add a special end-of-word marker if needed. Build a frequency table.

Count pairs — scan all adjacent symbol pairs across the corpus and count how often each pair co-occurs.

Merge the most frequent pair — replace every occurrence of the most common pair with a new merged symbol. Add this pair to the merge table.

Repeat until the vocabulary reaches the target size (e.g. 50,000 for GPT-2).

The result is a merge table — an ordered list of pair merges. At inference time, the same merges are applied in the same order to tokenize new text.

Step 0: Initial characters — scan all adjacent pairs in the corpus

Corpus

low
low
lower
lower
newer
newer
wider
wider

Pair Counts

e·rmax3
l·o2
o·w2
w·e2

Merge Table

No merges yet…

Step 1 / 4

BPE merge steps on a small corpus. Highlighted tokens are the pair being merged in each step.


Mathematical View

Let V0=ΣV_0 = \Sigma^* be the initial character vocabulary. After nn merge operations, we have a vocabulary VnV_n. Each merge operation:

Vk+1=Vk{ab}where (a,b)=argmax(a,b)Vk×Vk  freq(ab)V_{k+1} = V_k \cup \{ab\} \quad \text{where } (a, b) = \underset{(a,b) \in V_k \times V_k}{\arg\max}\; \text{freq}(ab)

The final tokenization of a string ss is determined by applying the learned merge rules greedily in priority order.


Vocabulary Lookup

Once the text is split into tokens, each token is mapped to an integer ID via a vocabulary table — a simple hash map.

encode:tokenZdecode:Ztoken\text{encode}: \text{token} \rightarrow \mathbb{Z} \qquad \text{decode}: \mathbb{Z} \rightarrow \text{token}

Special tokens are reserved at fixed positions in every vocabulary:

TokenIDPurpose
[BOS] / <s>1Beginning of sequence
[EOS] / </s>2End of sequence
[PAD]0Padding to uniform length
[UNK]3Unknown / out-of-vocabulary
[MASK]4Masked token (BERT-style)

Token Sequence

[BOS]
transform
ers
Ġare
Ġpower
ful
[EOS]

Vocabulary Table

[BOS]1
transform4300
ers1158
Ġare389
Ġpower1917
ful913
[EOS]2

Token ID

1

Each token is looked up in the vocabulary table to produce its integer ID, which is then passed to the embedding layer.


From IDs to Embeddings

The token ID sequence is still just a list of integers — no semantic content. The embedding layer converts each ID into a dense vector:

ei=E[ti]where ERV×d\mathbf{e}_i = E[t_i] \quad \text{where } E \in \mathbb{R}^{|V| \times d}
  • V|V|: vocabulary size (e.g. 50,257 for GPT-2)
  • dd: embedding dimension (e.g. 768 for GPT-2 base)
  • E[ti]E[t_i]: the tit_i-th row of the embedding matrix

The embedding matrix EE is a learned parameter — its rows are updated during training via backpropagation.


Other Tokenization Algorithms

BPE is the most common, but two other algorithms are widely used:

WordPiece (BERT, RoBERTa)

WordPiece also iteratively merges pairs, but the merge criterion is different. Instead of raw frequency, it maximises the likelihood of the training data under the language model:

score(a,b)=freq(ab)freq(a)×freq(b)\text{score}(a, b) = \frac{\text{freq}(ab)}{\text{freq}(a) \times \text{freq}(b)}

Tokens not at the start of a word are prefixed with ## (e.g. ##ing).

Unigram Language Model (SentencePiece / T5 / ALBERT)

Unigram starts with a large candidate vocabulary and prunes it. Each token's probability is modelled independently:

P(x)=i=1np(ti),tVp(t)=1P(x) = \prod_{i=1}^{n} p(t_i), \quad \sum_{t \in V} p(t) = 1

Tokens with the lowest marginal likelihood are removed iteratively until the target vocabulary size is reached.

AlgorithmDirectionCriterionUsed in
BPEBottom-up (merge)FrequencyGPT-2, LLaMA, Falcon
WordPieceBottom-up (merge)Likelihood ratioBERT, RoBERTa
Unigram LMTop-down (prune)Marginal likelihoodT5, ALBERT, mBART

Interactive Tokenizer

Try typing your own text below. The tokenizer applies a simplified BPE-style vocabulary to split your input into subword tokens and assign IDs.

Tokens (8)

[BOS]transformersarepowerful[EOS]

Token IDs

[1, 300, 301, 303, 4, 500, 501, 2]
chars: 23
tokens: 8
ratio: 2.7 t/word

A simplified subword tokenizer. Notice how compound words decompose into known subword pieces, and the token-to-word ratio changes with vocabulary coverage.

Real tokenizers are larger

Production tokenizers (GPT-2: 50,257 tokens; LLaMA-3: 128,256 tokens) have far richer vocabularies. The interactive demo above uses a small illustrative vocabulary — most words will decompose into character-level pieces.


Summary

ConceptWhat it does
TokenizerSplits text into discrete subword units
BPELearns merges from corpus frequency; builds vocabulary bottom-up
Vocabulary tableMaps token string → integer ID
Embedding layerMaps integer ID → dense learned vector

Tokenization is the first and most constrained stage of any language model pipeline. The vocabulary is fixed after training — anything outside it must be decomposed. Understanding this boundary is essential for debugging unexpected model behaviour on rare words, numbers, code, and multilingual text.

On this page