RNN, LSTM & GRU: Sequence Modeling
How recurrent networks learn from sequences — the hidden state, vanishing gradients, LSTM's memory cell, and GRU's streamlined gating — with interactive visualizations of each architecture.
RNN, LSTM & GRU: Sequence Modeling
Language, audio, time series, and video all share a property that image classification does not: order matters. The word "not" completely reverses the meaning of "good" in a sentiment sentence. A feedforward network has no way to express this — it sees each token in isolation, with no memory of what came before.
Recurrent Neural Networks (RNNs) solve this by threading a hidden state through the sequence, accumulating context as they process each element. LSTM and GRU extend this idea with gating mechanisms that overcome the fundamental weakness of vanilla RNNs: the vanishing gradient problem.
1. The Recurrent Neural Network
Architecture
An RNN processes a sequence one element at a time. At each step , it takes the current input and the previous hidden state , and produces a new hidden state :
The output at each step can be , or just the final hidden state may be used as a summary of the whole sequence.
A critical property: the weight matrices , , are shared across all time steps. The same transformation is applied at and . This weight sharing lets the network generalize to sequences of any length and keeps parameter count constant regardless of sequence length.
Unrolling Through Time
"Unrolling" is the conceptual trick of drawing one copy of the RNN cell for each time step, connected by the hidden state:
RNN Unrolled Through Time
Each cell shares the same weights — click any cell to jump to that step
t = 0: Processing "The". Starting from h₋₁ = 0 (zero vector — no prior context). Output h0 now encodes everything up to "The".
Each cell is identical — same weights, same computation — but receives a different context via .
Many-to-Many, One-to-Many, Many-to-One
The same RNN architecture supports multiple task shapes:
| Mode | Input | Output | Example |
|---|---|---|---|
| Many-to-Many | Full sequence | Full sequence | Machine translation, POS tagging |
| Many-to-One | Full sequence | Single vector | Sentiment classification |
| One-to-Many | Single vector | Full sequence | Image captioning |
2. Backpropagation Through Time
Training an RNN uses Backpropagation Through Time (BPTT): the unrolled network is treated as a very deep feedforward network, and gradients are propagated backward through each time step all the way to .
The loss gradient with respect to at step requires multiplying through all earlier Jacobians:
Each factor involves the derivative of , which is bounded by .
The Vanishing Gradient Problem
When these Jacobians are repeatedly multiplied, the result shrinks exponentially. After as few as 10 steps, the gradient reaching the earliest tokens is essentially zero — early inputs have no influence on learning.
Gradient Flow Through Time
Relative gradient magnitude reaching each past time step
← Earlier steps · · · · · · · · · Current →
Vanilla RNN: each BPTT step multiplies the gradient by the Jacobian of tanh (≤ 0.42). After 12 steps the gradient is essentially zero — the network can't learn what happened early in the sequence.
Exploding Gradients
Gradients can also explode if is large. This is handled in practice by gradient clipping: if , rescale to . Vanishing gradients are harder to fix because there is no signal to detect them — the gradient just silently disappears.
3. Long Short-Term Memory (LSTM)
Hochreiter & Schmidhuber (1997) proposed the solution: separate the "memory" from the "working state" and control what is written and read via learned gates.
The Cell State — A Gradient Highway
The key innovation is the cell state , a vector that runs horizontally through time with only minor, linear interactions:
Because the update is an addition (not a nonlinear squash), gradients flow back through without attenuation as long as the forget gate . This is the "constant error carousel" that makes LSTM work.
The Four Computations
All four computations take the same concatenated input :
Forget gate — decides what fraction of the old cell state to keep:
Input gate — decides how much of the new candidate to write:
Candidate cell state — the proposed new values:
Output gate — filters what part of the cell state becomes the hidden state:
Why σ for gates, tanh for values?
Sigmoid outputs are in — perfect for a multiplicative gate (0 = closed, 1 = fully open). Tanh outputs are in , encoding signed values. This is why candidate states use tanh: they represent changes that can increase or decrease the cell state.
Interactive LSTM Cell
Adjust , , and to see how each gate responds. Step through all six computations:
LSTM Gate Calculator
Adjust inputs — trace data flow through all 6 computation steps
σ = 0.546 — retains 55% of c_{t-1} = 0.80.
4. Gated Recurrent Unit (GRU)
Cho et al. (2014) proposed a simplified architecture that retains the gating mechanism but eliminates the separate cell state, reducing the number of parameters by ~25%.
Two Gates Instead of Four
Reset gate — controls how much past state influences the candidate:
Update gate — interpolates between old state and new candidate (replaces both forget and input gates):
Candidate hidden state — the reset gate moderates the role of :
New hidden state — a convex combination controlled by :
When the cell ignores the input and copies the old state — equivalent to a maximally active forget gate. When it discards all prior context and adopts the candidate.
Interactive GRU Cell
GRU Cell Calculator
2 gates, no separate cell state — streamlined design with similar capacity to LSTM
0.571 → uses 57% of previous hidden state when computing candidate.
0.643 → mix: 64% new candidate + 36% old state.
0.577 — proposed new hidden state, modulated by reset gate.
36% × 0.60 + 64% × 0.58 = 0.585
Key insight: The update gate z replaces both forget and input gates from LSTM. When z ≈ 0, the cell copies the old state unchanged. When z ≈ 1, it replaces it with the new candidate entirely. The reset gate r controls how much past context shapes the candidate.
GRU vs LSTM — When Does It Matter?
On most benchmarks, GRU and LSTM perform comparably. LSTM tends to win on tasks requiring very long-range memory or highly structured dependencies (e.g., parsing). GRU wins when training time matters or data is limited, because fewer parameters means less overfitting. In practice, try GRU first and switch to LSTM if performance plateaus.
5. Comparison
RNN Family Comparison
Core update rule
Gates
None
Params
1× (baseline)
Memory Span
~5–10 steps
Speed
⚡⚡⚡ Fastest
Strengths
- ✓Simplest architecture
- ✓Fewest parameters
- ✓Fast to train & deploy
Weaknesses
- ✗Vanishing gradient
- ✗Can't learn long-range deps
- ✗Rarely used in production
Parameter Count
For a hidden size and input size , the parameter count per gate is .
| Model | Gates | Param count (approx.) |
|---|---|---|
| RNN | 0 | |
| GRU | 3 | |
| LSTM | 4 |
This is why GRU (3× RNN) trains notably faster than LSTM (4× RNN) at the same hidden dimension.
6. Beyond RNNs
Transformers (Vaswani et al., 2017) have displaced RNNs for most sequence tasks. Instead of processing tokens one-by-one, Transformers attend to all positions in parallel — eliminating the sequential bottleneck and scaling to much larger models.
However, RNNs remain relevant in specific settings:
| Scenario | Why RNN wins |
|---|---|
| Streaming / online inference | Process one token at a time, memory per step |
| Very long sequences | Transformers are in attention; RNNs are |
| On-device / edge deployment | Small constant state, no KV-cache needed |
| Text-to-speech | WaveNet and WaveRNN still use dilated convolutions + recurrence |
State Space Models
Structured State Space Models (S4, Mamba) are a recent class of sequence models that combine the inference cost of RNNs with the parallelizable training of convolutions. They can be viewed as a learned, structured generalization of the LSTM cell state highway.
Summary
| Concept | Key idea |
|---|---|
| RNN hidden state | A fixed-size summary of all tokens seen so far, updated at each step |
| Weight sharing | Same , used at every time step — generalizes to any length |
| Vanishing gradient | Repeated Jacobian multiplication shrinks gradients exponentially in vanilla RNNs |
| LSTM cell state | Additive update path that carries gradients across long time spans without attenuation |
| LSTM gates | Forget (what to erase), Input + Candidate (what to write), Output (what to expose) |
| GRU update gate | Single gate that interpolates between "copy old state" and "replace with candidate" |
| GRU reset gate | Controls how much past context enters the candidate computation |
Dropout: Regularization by Noise
How randomly silencing neurons during training prevents overfitting — with interactive visualizations of masks, rate effects, and the inverted-dropout trick.
Positional Embeddings in Transformers
A deep dive into absolute, relative, sine/cosine, and rotational positional embeddings — with interactive playgrounds.