My App
Deep Neural Networks

Tokens to Embeddings — Giving Numbers Meaning

How integer token IDs become dense learned vectors that carry semantic meaning — the embedding layer explained from lookup table to positional encoding.

From IDs to Vectors

The tokenizer converts text into a list of integers. Those integers have no inherent meaning — the number 4300 knows nothing about transform. The embedding layer fixes this by mapping each integer to a dense vector of real numbers:

ei=E[ti]where ERV×d\mathbf{e}_i = E[t_i] \quad \text{where } E \in \mathbb{R}^{|V| \times d}

The operation is a row lookup: token ID tit_i selects row tit_i from the embedding matrix EE. The result is a dd-dimensional vector that the model can operate on.

Token ID → Row Lookup → Embedding Vector

Token

transformid: 4300

E ∈ ℝ|V|×d

4300
0.82-0.310.540.17
1158
0.210.74-0.480.66
389
-0.550.190.83-0.27
1917
0.67-0.520.290.88
913
0.130.88-0.610.42

showing 5 of 50,257 rows

Embedding ei

extracting…

Each token ID selects a row from the embedding matrix E. The extracted row is the token's dense vector representation.


The Embedding Matrix

EE has shape V×d|V| \times d, where:

SymbolMeaningGPT-2 baseLLaMA-3 8B
V\|V\|Vocabulary size50,257128,256
ddEmbedding dimension7684,096
Total paramsV×d\|V\| \times d~38.6 M~524 M

Every entry of EE is a learned parameter, updated by backpropagation during training. At initialisation the rows are random. After training, each row encodes everything the model learned about that token from billions of examples.

Why not one-hot vectors?

One-hot encoding gives each token a vector of size V|V| with a single 1. It is sparse, high-dimensional, and has no notion of similarity — cat and kitten are orthogonal. Dense embeddings learn a geometry where similar tokens have similar vectors.


What Embeddings Learn

Well-trained embeddings encode semantic and syntactic relationships as geometric structure. The classic example:

kingman+womanqueen\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

This works because the model learns that the "royalty" direction and the "gender" direction are roughly orthogonal axes in the embedding space. Words used in similar contexts end up with similar vectors — a consequence of the distributional hypothesis.

2D Projection of Embedding Space

dim 0 →dim 1
king
queen
man
woman
cat
dog
bird
run
walk
swim
city
town
royalty
people
animal
action
place

A 2D projection of embedding space. Words in the same semantic group cluster together. Toggle the analogy view to see the parallel gender–royalty structure.

Projection caveat

Real embeddings live in hundreds or thousands of dimensions. The 2D view above uses a dimensionality-reduction projection (like PCA or t-SNE) and preserves only a fraction of the structure. Relationships invisible in 2D may be clear in the full space.


Position Is Missing

Embeddings represent what a token is, but not where it appears in the sequence. The transformer's self-attention is permutation-equivariant — shuffling the input produces shuffled output — so without position information, "cat sat on mat" and "mat on sat cat" would produce identical embeddings.

The fix is positional encoding: a vector PE(pos)\text{PE}(pos) added to the token embedding at each position.

xi=ei+PE(i)\mathbf{x}_i = \mathbf{e}_i + \text{PE}(i)

Sinusoidal Positional Encoding

The original transformer (Vaswani et al., 2017) uses a fixed sinusoidal scheme:

PE(pos,2k)=sin ⁣(pos100002k/d)\text{PE}(pos, 2k) = \sin\!\left(\frac{pos}{10000^{2k/d}}\right) PE(pos,2k+1)=cos ⁣(pos100002k/d)\text{PE}(pos, 2k+1) = \cos\!\left(\frac{pos}{10000^{2k/d}}\right)

Even-indexed dimensions use sine; odd-indexed use cosine. The frequency decreases as dimension index kk increases — high-kk dimensions oscillate slowly and encode coarse position, while low-kk dimensions oscillate fast and encode fine position.

Why sinusoids?

  • Every position has a unique encoding vector.
  • The model can extrapolate to longer sequences than seen during training — the sinusoids extend infinitely.
  • The encoding for position pos+δpos + \delta is a linear function of the encoding at pospos, which makes it easy for the model to learn relative position relationships.

Positional Encoding

d0d1d2d3d4d5d6d7
the [0]
0.0
1.0
0.0
1.0
0.0
1.0
0.0
1.0
cat [1]
0.8
0.5
0.3
1.0
0.1
1.0
0.0
1.0
sat [2]
0.9
-0.4
0.6
0.8
0.2
1.0
0.1
1.0
on [3]
0.1
-1.0
0.8
0.6
0.3
1.0
0.1
1.0
mat [4]
-0.8
-0.7
1.0
0.3
0.4
0.9
0.1
1.0

Each row is PE(pos). Click a token to inspect.

Left: heatmap of PE values across positions and dimensions. Right: addition view — select a token position to see how its PE vector is added to the token embedding.


Learned Positional Encodings

Modern models often replace fixed sinusoids with learned positional embeddings — a second embedding matrix PRLmax×dP \in \mathbb{R}^{L_{max} \times d} where LmaxL_{max} is the maximum sequence length. GPT-2, GPT-3, and most BERT variants use this approach.

MethodExtrapolates?ParametersUsed in
Sinusoidal (fixed)Yes0Transformer (2017), vanilla ViT
Learned absoluteNo (hard cutoff)Lmax×dL_{max} \times dGPT-2, BERT, RoBERTa
Rotary (RoPE)Yes (via relative pos)0LLaMA, Mistral, Gemma
ALiBiYes (linear decay)0MPT, BLOOM

RoPE

Rotary Position Embedding (RoPE) encodes position by rotating the query and key vectors before attention — not by adding a vector to the embedding. This naturally encodes relative distances and generalises well to lengths beyond training.


The Full Pipeline

Tokenize — split raw text into subword tokens using BPE or WordPiece.

ID lookup — map each token to an integer ID via the vocabulary table.

Embedding lookup — select row E[ti]E[t_i] for each token ID. This is the token embedding.

Add positional encoding — compute xi=E[ti]+PE(i)\mathbf{x}_i = E[t_i] + \text{PE}(i). This is the input to the first transformer block.

Transformer blocksxi\mathbf{x}_i flows through NN layers of self-attention and feed-forward networks.

Full Embedding Pipeline

Raw text split into subword tokens

the
cat
sat

The four stages from raw tokens to transformer input. Step through each stage to see what the data looks like at each point.


Embedding Visualisation Tips

When debugging or analysing a model's embeddings, a few techniques are useful:

Cosine similarity — the standard metric for embedding proximity. Not Euclidean distance, because the magnitude of an embedding vector carries less information than its direction:

sim(u,v)=uvuv\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\|\mathbf{v}\|}

t-SNE / UMAP — project dd-dimensional embeddings to 2D for visualisation. Both preserve local structure (nearby points stay nearby) but distort global distances.

Probing classifiers — train a linear classifier on top of frozen embeddings to test whether a specific property (part of speech, sentiment, entity type) is linearly decodable. If it is, the embedding encodes that property.


Summary

ConceptDetail
Embedding layerLearned matrix EE of shape vocab × dd; each token ID selects a row
Embedding dimensiondd = 768 (GPT-2), 4096 (LLaMA-3); controls model capacity
What embeddings learnGeometry encodes semantic/syntactic similarity from co-occurrence statistics
Positional encodingAdded vector PE(i)\text{PE}(i) so the model knows token order
Sinusoidal PEFixed, closed-form, extrapolates; used in original transformer
Learned PETrained matrix; bounded to LmaxL_{max}; GPT-2, BERT
RoPE / ALiBiRelative position schemes; better length generalisation

The embedding layer is the only place in a standard transformer where token identity and sequence position are explicitly represented. Everything the model knows about "what this word is" and "where it sits in the sentence" enters through this single addition: xi=E[ti]+PE(i)\mathbf{x}_i = E[t_i] + \text{PE}(i).

On this page