Tokens to Embeddings — Giving Numbers Meaning

How integer token IDs become dense learned vectors that carry semantic meaning — the embedding layer explained from lookup table to positional encoding.

From IDs to Vectors

The tokenizer converts text into a list of integers. Those integers have no inherent meaning — the number 4300 knows nothing about transform. The embedding layer fixes this by mapping each integer to a dense vector of real numbers:

\mathbf{e}_i = E[t_i] \quad \text{where } E \in \mathbb{R}^{|V| \times d}

The operation is a row lookup: token ID $t_i$ selects row $t_i$ from the embedding matrix $E$ . The result is a $d$ -dimensional vector that the model can operate on.

Token ID → Row Lookup → Embedding Vector

Token

transformid: 4300

E ∈ ℝ^|V|×d

4300

0.82-0.310.540.17

1158

0.210.74-0.480.66

389

-0.550.190.83-0.27

1917

0.67-0.520.290.88

913

0.130.88-0.610.42

showing 5 of 50,257 rows

Embedding e_i

extracting…

Each token ID selects a row from the embedding matrix E. The extracted row is the token's dense vector representation.

The Embedding Matrix

$E$ has shape $|V| \times d$ , where:

Symbol	Meaning	GPT-2 base	LLaMA-3 8B
$\\|V\\|$	Vocabulary size	50,257	128,256
$d$	Embedding dimension	768	4,096
Total params	$\\|V\\| \times d$	~38.6 M	~524 M

Every entry of $E$ is a learned parameter, updated by backpropagation during training. At initialisation the rows are random. After training, each row encodes everything the model learned about that token from billions of examples.

Why not one-hot vectors?

One-hot encoding gives each token a vector of size $|V|$ with a single 1. It is sparse, high-dimensional, and has no notion of similarity — cat and kitten are orthogonal. Dense embeddings learn a geometry where similar tokens have similar vectors.

What Embeddings Learn

Well-trained embeddings encode semantic and syntactic relationships as geometric structure. The classic example:

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

This works because the model learns that the "royalty" direction and the "gender" direction are roughly orthogonal axes in the embedding space. Words used in similar contexts end up with similar vectors — a consequence of the distributional hypothesis.

2D Projection of Embedding Space

dim 0 →dim 1

king

queen

man

woman

cat

dog

bird

run

walk

swim

city

town

royalty

people

animal

action

place

A 2D projection of embedding space. Words in the same semantic group cluster together. Toggle the analogy view to see the parallel gender–royalty structure.

Projection caveat

Real embeddings live in hundreds or thousands of dimensions. The 2D view above uses a dimensionality-reduction projection (like PCA or t-SNE) and preserves only a fraction of the structure. Relationships invisible in 2D may be clear in the full space.

Position Is Missing

Embeddings represent what a token is, but not where it appears in the sequence. The transformer's self-attention is permutation-equivariant — shuffling the input produces shuffled output — so without position information, "cat sat on mat" and "mat on sat cat" would produce identical embeddings.

The fix is positional encoding: a vector $\text{PE}(pos)$ added to the token embedding at each position.

\mathbf{x}_i = \mathbf{e}_i + \text{PE}(i)

Sinusoidal Positional Encoding

The original transformer (Vaswani et al., 2017) uses a fixed sinusoidal scheme:

\text{PE}(pos, 2k) = \sin\!\left(\frac{pos}{10000^{2k/d}}\right)

\text{PE}(pos, 2k+1) = \cos\!\left(\frac{pos}{10000^{2k/d}}\right)

Even-indexed dimensions use sine; odd-indexed use cosine. The frequency decreases as dimension index $k$ increases — high- $k$ dimensions oscillate slowly and encode coarse position, while low- $k$ dimensions oscillate fast and encode fine position.

Why sinusoids?

Every position has a unique encoding vector.
The model can extrapolate to longer sequences than seen during training — the sinusoids extend infinitely.
The encoding for position $pos + \delta$ is a linear function of the encoding at $pos$ , which makes it easy for the model to learn relative position relationships.

Positional Encoding

d0d1d2d3d4d5d6d7

the [0]

0.0

1.0

0.0

1.0

0.0

1.0

0.0

1.0

cat [1]

0.8

0.5

0.3

1.0

0.1

1.0

0.0

1.0

sat [2]

0.9

-0.4

0.6

0.8

0.2

1.0

0.1

1.0

on [3]

0.1

-1.0

0.8

0.6

0.3

1.0

0.1

1.0

mat [4]

-0.8

-0.7

1.0

0.3

0.4

0.9

0.1

1.0

Each row is PE(pos). Click a token to inspect.

Left: heatmap of PE values across positions and dimensions. Right: addition view — select a token position to see how its PE vector is added to the token embedding.

Learned Positional Encodings

Modern models often replace fixed sinusoids with learned positional embeddings — a second embedding matrix $P \in \mathbb{R}^{L_{max} \times d}$ where $L_{max}$ is the maximum sequence length. GPT-2, GPT-3, and most BERT variants use this approach.

Method	Extrapolates?	Parameters	Used in
Sinusoidal (fixed)	Yes	0	Transformer (2017), vanilla ViT
Learned absolute	No (hard cutoff)	$L_{max} \times d$	GPT-2, BERT, RoBERTa
Rotary (RoPE)	Yes (via relative pos)	0	LLaMA, Mistral, Gemma
ALiBi	Yes (linear decay)	0	MPT, BLOOM

RoPE

Rotary Position Embedding (RoPE) encodes position by rotating the query and key vectors before attention — not by adding a vector to the embedding. This naturally encodes relative distances and generalises well to lengths beyond training.

The Full Pipeline

Tokenize — split raw text into subword tokens using BPE or WordPiece.

ID lookup — map each token to an integer ID via the vocabulary table.

Embedding lookup — select row $E[t_i]$ for each token ID. This is the token embedding.

Add positional encoding — compute $\mathbf{x}_i = E[t_i] + \text{PE}(i)$ . This is the input to the first transformer block.

Transformer blocks — $\mathbf{x}_i$ flows through $N$ layers of self-attention and feed-forward networks.

Full Embedding Pipeline

Raw text split into subword tokens

“the”

“cat”

“sat”

The four stages from raw tokens to transformer input. Step through each stage to see what the data looks like at each point.

Embedding Visualisation Tips

When debugging or analysing a model's embeddings, a few techniques are useful:

Cosine similarity — the standard metric for embedding proximity. Not Euclidean distance, because the magnitude of an embedding vector carries less information than its direction:

\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|}

t-SNE / UMAP — project $d$ -dimensional embeddings to 2D for visualisation. Both preserve local structure (nearby points stay nearby) but distort global distances.

Probing classifiers — train a linear classifier on top of frozen embeddings to test whether a specific property (part of speech, sentiment, entity type) is linearly decodable. If it is, the embedding encodes that property.

Summary

Concept	Detail
Embedding layer	Learned matrix $E$ of shape vocab × $d$ ; each token ID selects a row
Embedding dimension	$d$ = 768 (GPT-2), 4096 (LLaMA-3); controls model capacity
What embeddings learn	Geometry encodes semantic/syntactic similarity from co-occurrence statistics
Positional encoding	Added vector $\text{PE}(i)$ so the model knows token order
Sinusoidal PE	Fixed, closed-form, extrapolates; used in original transformer
Learned PE	Trained matrix; bounded to $L_{max}$ ; GPT-2, BERT
RoPE / ALiBi	Relative position schemes; better length generalisation

The embedding layer is the only place in a standard transformer where token identity and sequence position are explicitly represented. Everything the model knows about "what this word is" and "where it sits in the sentence" enters through this single addition: $\mathbf{x}_i = E[t_i] + \text{PE}(i)$ .

Tokens to Embeddings — Giving Numbers Meaning

On this page