How recurrent networks learn from sequences — the hidden state, vanishing gradients, LSTM's memory cell, and GRU's streamlined gating — with interactive visualizations of each architecture.

RNN, LSTM & GRU: Sequence Modeling

Language, audio, time series, and video all share a property that image classification does not: order matters. The word "not" completely reverses the meaning of "good" in a sentiment sentence. A feedforward network has no way to express this — it sees each token in isolation, with no memory of what came before.

Recurrent Neural Networks (RNNs) solve this by threading a hidden state through the sequence, accumulating context as they process each element. LSTM and GRU extend this idea with gating mechanisms that overcome the fundamental weakness of vanilla RNNs: the vanishing gradient problem.

1. The Recurrent Neural Network

Architecture

An RNN processes a sequence $x_1, x_2, \ldots, x_T$ one element at a time. At each step $t$ , it takes the current input $x_t$ and the previous hidden state $h_{t-1}$ , and produces a new hidden state $h_t$ :

h_t = \tanh\!\left(W_h \, h_{t-1} + W_x \, x_t + b\right)

The output at each step can be $y_t = W_y h_t + b_y$ , or just the final hidden state $h_T$ may be used as a summary of the whole sequence.

A critical property: the weight matrices $W_h$ , $W_x$ , $W_y$ are shared across all time steps. The same transformation is applied at $t = 1$ and $t = T$ . This weight sharing lets the network generalize to sequences of any length and keeps parameter count constant regardless of sequence length.

Unrolling Through Time

"Unrolling" is the conceptual trick of drawing one copy of the RNN cell for each time step, connected by the hidden state:

RNN Unrolled Through Time

Each cell shares the same weights — click any cell to jump to that step

h₋₁=0

RNN

The

RNN

cat

RNN

sat

RNN

mat

Input xₜ

Hidden state hₜ

State transfer

t = 0: Processing "The". Starting from h₋₁ = 0 (zero vector — no prior context). Output h0 now encodes everything up to "The".

Each cell is identical — same weights, same computation — but receives a different context via $h_{t-1}$ .

Many-to-Many, One-to-Many, Many-to-One

The same RNN architecture supports multiple task shapes:

Mode	Input	Output	Example
Many-to-Many	Full sequence	Full sequence	Machine translation, POS tagging
Many-to-One	Full sequence	Single vector	Sentiment classification
One-to-Many	Single vector	Full sequence	Image captioning

2. Backpropagation Through Time

Training an RNN uses Backpropagation Through Time (BPTT): the unrolled network is treated as a very deep feedforward network, and gradients are propagated backward through each time step all the way to $t = 1$ .

The loss $\mathcal{L}$ gradient with respect to $W_h$ at step $t$ requires multiplying through all earlier Jacobians:

\frac{\partial \mathcal{L}}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial h_t} \prod_{k=t}^{T-1} \frac{\partial h_{k+1}}{\partial h_k}

Each factor $\frac{\partial h_{k+1}}{\partial h_k} = \text{diag}\!\left(1 - h_k^2\right) W_h$ involves the derivative of $\tanh$ , which is bounded by $[0, 1]$ .

The Vanishing Gradient Problem

When these Jacobians are repeatedly multiplied, the result shrinks exponentially. After as few as 10 steps, the gradient reaching the earliest tokens is essentially zero — early inputs have no influence on learning.

Gradient Flow Through Time

Relative gradient magnitude reaching each past time step

t-12

t-11

t-10

t-9

t-8

t-7

t-6

t-5

t-4

t-3

t-2

t-1

← Earlier steps · · · · · · · · · Current →

Vanilla RNN: each BPTT step multiplies the gradient by the Jacobian of tanh (≤ 0.42). After 12 steps the gradient is essentially zero — the network can't learn what happened early in the sequence.

Exploding Gradients

Gradients can also explode if $|W_h|$ is large. This is handled in practice by gradient clipping: if $\|\nabla\| > \theta$ , rescale to $\theta \cdot \nabla / \|\nabla\|$ . Vanishing gradients are harder to fix because there is no signal to detect them — the gradient just silently disappears.

3. Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber (1997) proposed the solution: separate the "memory" from the "working state" and control what is written and read via learned gates.

The Cell State — A Gradient Highway

The key innovation is the cell state $c_t$ , a vector that runs horizontally through time with only minor, linear interactions:

c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

Because the update is an addition (not a nonlinear squash), gradients flow back through $c_t$ without attenuation as long as the forget gate $f_t \approx 1$ . This is the "constant error carousel" that makes LSTM work.

The Four Computations

All four computations take the same concatenated input $[h_{t-1},\, x_t]$ :

Forget gate — decides what fraction of the old cell state to keep:

f_t = \sigma\!\left(W_f [h_{t-1}, x_t] + b_f\right)

Input gate — decides how much of the new candidate to write:

i_t = \sigma\!\left(W_i [h_{t-1}, x_t] + b_i\right)

Candidate cell state — the proposed new values:

\tilde{c}_t = \tanh\!\left(W_c [h_{t-1}, x_t] + b_c\right)

Output gate — filters what part of the cell state becomes the hidden state:

o_t = \sigma\!\left(W_o [h_{t-1}, x_t] + b_o\right), \qquad h_t = o_t \odot \tanh(c_t)

Why σ for gates, tanh for values?

Sigmoid outputs are in $[0, 1]$ — perfect for a multiplicative gate (0 = closed, 1 = fully open). Tanh outputs are in $[-1, 1]$ , encoding signed values. This is why candidate states use tanh: they represent changes that can increase or decrease the cell state.

Interactive LSTM Cell

Adjust $h_{t-1}$ , $x_t$ , and $c_{t-1}$ to see how each gate responds. Step through all six computations:

LSTM Gate Calculator

Adjust inputs — trace data flow through all 6 computation steps

h_{t-1}

0.50

x_t

0.60

c_{t-1}

0.80

Forget Gate0.546

$f_t = σ(W_f · [h_{t-1}, x_t] + b_f)$

σ = 0.546 — retains 55% of c_{t-1} = 0.80.

Input Gate0.630

$i_t = σ(W_i · [h_{t-1}, x_t] + b_i)$

Candidate Cell0.521

$c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)$

New Cell State0.765

$c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t$

Output Gate0.613

$o_t = σ(W_o · [h_{t-1}, x_t] + b_o)$

Hidden State0.395

$h_t = o_t ⊙ tanh(c_t)$

1 / 6

4. Gated Recurrent Unit (GRU)

Cho et al. (2014) proposed a simplified architecture that retains the gating mechanism but eliminates the separate cell state, reducing the number of parameters by ~25%.

Two Gates Instead of Four

Reset gate — controls how much past state influences the candidate:

r_t = \sigma\!\left(W_r [h_{t-1}, x_t]\right)

Update gate — interpolates between old state and new candidate (replaces both forget and input gates):

z_t = \sigma\!\left(W_z [h_{t-1}, x_t]\right)

Candidate hidden state — the reset gate moderates the role of $h_{t-1}$ :

\tilde{h}_t = \tanh\!\left(W [r_t \odot h_{t-1},\; x_t]\right)

New hidden state — a convex combination controlled by $z_t$ :

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

When $z_t = 0$ the cell ignores the input and copies the old state — equivalent to a maximally active forget gate. When $z_t = 1$ it discards all prior context and adopts the candidate.

Interactive GRU Cell

GRU Cell Calculator

2 gates, no separate cell state — streamlined design with similar capacity to LSTM

h_{t-1}

(prev hidden)0.60

x_t

(input)0.50

Reset Gate

r_t

0.571

r_t = \sigma(W_r \cdot [h_{t-1}, x_t])

0.571 → uses 57% of previous hidden state when computing candidate.

Update Gate

z_t

0.643

z_t = \sigma(W_z \cdot [h_{t-1}, x_t])

0.643 → mix: 64% new candidate + 36% old state.

Candidate

\tilde{h}_t

0.577

\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])

0.577 — proposed new hidden state, modulated by reset gate.

New Hidden

h_t

0.585

h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

36% × 0.60 + 64% × 0.58 = 0.585

Key insight: The update gate z replaces both forget and input gates from LSTM. When z ≈ 0, the cell copies the old state unchanged. When z ≈ 1, it replaces it with the new candidate entirely. The reset gate r controls how much past context shapes the candidate.

GRU vs LSTM — When Does It Matter?

On most benchmarks, GRU and LSTM perform comparably. LSTM tends to win on tasks requiring very long-range memory or highly structured dependencies (e.g., parsing). GRU wins when training time matters or data is limited, because fewer parameters means less overfitting. In practice, try GRU first and switch to LSTM if performance plateaus.

5. Comparison

RNN Family Comparison

Core update rule

h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

Gates

None

Params

1× (baseline)

Memory Span

~5–10 steps

Speed

⚡⚡⚡ Fastest

Strengths

✓Simplest architecture
✓Fewest parameters
✓Fast to train & deploy

Weaknesses

✗Vanishing gradient
✗Can't learn long-range deps
✗Rarely used in production

Best for: Very short sequences, demos, educational contexts

Parameter Count

For a hidden size $d$ and input size $n$ , the parameter count per gate is $(n + d) \times d + d$ .

Model	Gates	Param count (approx.)
RNN	0	$(n + d) \cdot d$
GRU	3	$3 \cdot (n + d) \cdot d$
LSTM	4	$4 \cdot (n + d) \cdot d$

This is why GRU (3× RNN) trains notably faster than LSTM (4× RNN) at the same hidden dimension.

6. Beyond RNNs

Transformers (Vaswani et al., 2017) have displaced RNNs for most sequence tasks. Instead of processing tokens one-by-one, Transformers attend to all positions in parallel — eliminating the sequential bottleneck and scaling to much larger models.

However, RNNs remain relevant in specific settings:

Scenario	Why RNN wins
Streaming / online inference	Process one token at a time, $O(1)$ memory per step
Very long sequences	Transformers are $O(n^2)$ in attention; RNNs are $O(n)$
On-device / edge deployment	Small constant state, no KV-cache needed
Text-to-speech	WaveNet and WaveRNN still use dilated convolutions + recurrence

State Space Models

Structured State Space Models (S4, Mamba) are a recent class of sequence models that combine the $O(n)$ inference cost of RNNs with the parallelizable training of convolutions. They can be viewed as a learned, structured generalization of the LSTM cell state highway.

Summary

Concept	Key idea
RNN hidden state	A fixed-size summary of all tokens seen so far, updated at each step
Weight sharing	Same $W_h$ , $W_x$ used at every time step — generalizes to any length
Vanishing gradient	Repeated Jacobian multiplication shrinks gradients exponentially in vanilla RNNs
LSTM cell state	Additive update path that carries gradients across long time spans without attenuation
LSTM gates	Forget (what to erase), Input + Candidate (what to write), Output (what to expose)
GRU update gate	Single gate that interpolates between "copy old state" and "replace with candidate"
GRU reset gate	Controls how much past context enters the candidate computation