Tokens to Embeddings — Giving Numbers Meaning
How integer token IDs become dense learned vectors that carry semantic meaning — the embedding layer explained from lookup table to positional encoding.
From IDs to Vectors
The tokenizer converts text into a list of integers. Those integers have no inherent meaning — the number 4300 knows nothing about transform. The embedding layer fixes this by mapping each integer to a dense vector of real numbers:
The operation is a row lookup: token ID selects row from the embedding matrix . The result is a -dimensional vector that the model can operate on.
Token ID → Row Lookup → Embedding Vector
Token
E ∈ ℝ|V|×d
showing 5 of 50,257 rows
Embedding ei
Each token ID selects a row from the embedding matrix E. The extracted row is the token's dense vector representation.
The Embedding Matrix
has shape , where:
| Symbol | Meaning | GPT-2 base | LLaMA-3 8B |
|---|---|---|---|
| Vocabulary size | 50,257 | 128,256 | |
| Embedding dimension | 768 | 4,096 | |
| Total params | ~38.6 M | ~524 M |
Every entry of is a learned parameter, updated by backpropagation during training. At initialisation the rows are random. After training, each row encodes everything the model learned about that token from billions of examples.
Why not one-hot vectors?
One-hot encoding gives each token a vector of size with a single 1. It is sparse, high-dimensional, and has no notion of similarity — cat and kitten are orthogonal. Dense embeddings learn a geometry where similar tokens have similar vectors.
What Embeddings Learn
Well-trained embeddings encode semantic and syntactic relationships as geometric structure. The classic example:
This works because the model learns that the "royalty" direction and the "gender" direction are roughly orthogonal axes in the embedding space. Words used in similar contexts end up with similar vectors — a consequence of the distributional hypothesis.
2D Projection of Embedding Space
A 2D projection of embedding space. Words in the same semantic group cluster together. Toggle the analogy view to see the parallel gender–royalty structure.
Projection caveat
Real embeddings live in hundreds or thousands of dimensions. The 2D view above uses a dimensionality-reduction projection (like PCA or t-SNE) and preserves only a fraction of the structure. Relationships invisible in 2D may be clear in the full space.
Position Is Missing
Embeddings represent what a token is, but not where it appears in the sequence. The transformer's self-attention is permutation-equivariant — shuffling the input produces shuffled output — so without position information, "cat sat on mat" and "mat on sat cat" would produce identical embeddings.
The fix is positional encoding: a vector added to the token embedding at each position.
Sinusoidal Positional Encoding
The original transformer (Vaswani et al., 2017) uses a fixed sinusoidal scheme:
Even-indexed dimensions use sine; odd-indexed use cosine. The frequency decreases as dimension index increases — high- dimensions oscillate slowly and encode coarse position, while low- dimensions oscillate fast and encode fine position.
Why sinusoids?
- Every position has a unique encoding vector.
- The model can extrapolate to longer sequences than seen during training — the sinusoids extend infinitely.
- The encoding for position is a linear function of the encoding at , which makes it easy for the model to learn relative position relationships.
Positional Encoding
Each row is PE(pos). Click a token to inspect.
Left: heatmap of PE values across positions and dimensions. Right: addition view — select a token position to see how its PE vector is added to the token embedding.
Learned Positional Encodings
Modern models often replace fixed sinusoids with learned positional embeddings — a second embedding matrix where is the maximum sequence length. GPT-2, GPT-3, and most BERT variants use this approach.
| Method | Extrapolates? | Parameters | Used in |
|---|---|---|---|
| Sinusoidal (fixed) | Yes | 0 | Transformer (2017), vanilla ViT |
| Learned absolute | No (hard cutoff) | GPT-2, BERT, RoBERTa | |
| Rotary (RoPE) | Yes (via relative pos) | 0 | LLaMA, Mistral, Gemma |
| ALiBi | Yes (linear decay) | 0 | MPT, BLOOM |
RoPE
Rotary Position Embedding (RoPE) encodes position by rotating the query and key vectors before attention — not by adding a vector to the embedding. This naturally encodes relative distances and generalises well to lengths beyond training.
The Full Pipeline
Tokenize — split raw text into subword tokens using BPE or WordPiece.
ID lookup — map each token to an integer ID via the vocabulary table.
Embedding lookup — select row for each token ID. This is the token embedding.
Add positional encoding — compute . This is the input to the first transformer block.
Transformer blocks — flows through layers of self-attention and feed-forward networks.
Full Embedding Pipeline
Raw text split into subword tokens
The four stages from raw tokens to transformer input. Step through each stage to see what the data looks like at each point.
Embedding Visualisation Tips
When debugging or analysing a model's embeddings, a few techniques are useful:
Cosine similarity — the standard metric for embedding proximity. Not Euclidean distance, because the magnitude of an embedding vector carries less information than its direction:
t-SNE / UMAP — project -dimensional embeddings to 2D for visualisation. Both preserve local structure (nearby points stay nearby) but distort global distances.
Probing classifiers — train a linear classifier on top of frozen embeddings to test whether a specific property (part of speech, sentiment, entity type) is linearly decodable. If it is, the embedding encodes that property.
Summary
| Concept | Detail |
|---|---|
| Embedding layer | Learned matrix of shape vocab × ; each token ID selects a row |
| Embedding dimension | = 768 (GPT-2), 4096 (LLaMA-3); controls model capacity |
| What embeddings learn | Geometry encodes semantic/syntactic similarity from co-occurrence statistics |
| Positional encoding | Added vector so the model knows token order |
| Sinusoidal PE | Fixed, closed-form, extrapolates; used in original transformer |
| Learned PE | Trained matrix; bounded to ; GPT-2, BERT |
| RoPE / ALiBi | Relative position schemes; better length generalisation |
The embedding layer is the only place in a standard transformer where token identity and sequence position are explicitly represented. Everything the model knows about "what this word is" and "where it sits in the sentence" enters through this single addition: .
Tokenization — How Language Models Read Text
A deep dive into subword tokenization, Byte-Pair Encoding, and vocabulary lookup — the first stage of every language model.
Autoregressive Models
How modern language models generate sequences one token at a time — from the chain rule of probability to sampling strategies and causal attention.