My App
Deep Neural Networks

RNN, LSTM & GRU: Sequence Modeling

How recurrent networks learn from sequences — the hidden state, vanishing gradients, LSTM's memory cell, and GRU's streamlined gating — with interactive visualizations of each architecture.

RNN, LSTM & GRU: Sequence Modeling

Language, audio, time series, and video all share a property that image classification does not: order matters. The word "not" completely reverses the meaning of "good" in a sentiment sentence. A feedforward network has no way to express this — it sees each token in isolation, with no memory of what came before.

Recurrent Neural Networks (RNNs) solve this by threading a hidden state through the sequence, accumulating context as they process each element. LSTM and GRU extend this idea with gating mechanisms that overcome the fundamental weakness of vanilla RNNs: the vanishing gradient problem.


1. The Recurrent Neural Network

Architecture

An RNN processes a sequence x1,x2,,xTx_1, x_2, \ldots, x_T one element at a time. At each step tt, it takes the current input xtx_t and the previous hidden state ht1h_{t-1}, and produces a new hidden state hth_t:

ht=tanh ⁣(Whht1+Wxxt+b)h_t = \tanh\!\left(W_h \, h_{t-1} + W_x \, x_t + b\right)

The output at each step can be yt=Wyht+byy_t = W_y h_t + b_y, or just the final hidden state hTh_T may be used as a summary of the whole sequence.

A critical property: the weight matrices WhW_h, WxW_x, WyW_y are shared across all time steps. The same transformation is applied at t=1t = 1 and t=Tt = T. This weight sharing lets the network generalize to sequences of any length and keeps parameter count constant regardless of sequence length.

Unrolling Through Time

"Unrolling" is the conceptual trick of drawing one copy of the RNN cell for each time step, connected by the hidden state:

RNN Unrolled Through Time

Each cell shares the same weights — click any cell to jump to that step

h₋₁=0
h0
RNN
The
h1
RNN
cat
h2
RNN
sat
h3
RNN
on
h4
RNN
mat
ŷ
Input xₜ
Hidden state hₜ
State transfer

t = 0: Processing "The". Starting from h₋₁ = 0 (zero vector — no prior context). Output h0 now encodes everything up to "The".

Each cell is identical — same weights, same computation — but receives a different context via ht1h_{t-1}.

Many-to-Many, One-to-Many, Many-to-One

The same RNN architecture supports multiple task shapes:

ModeInputOutputExample
Many-to-ManyFull sequenceFull sequenceMachine translation, POS tagging
Many-to-OneFull sequenceSingle vectorSentiment classification
One-to-ManySingle vectorFull sequenceImage captioning

2. Backpropagation Through Time

Training an RNN uses Backpropagation Through Time (BPTT): the unrolled network is treated as a very deep feedforward network, and gradients are propagated backward through each time step all the way to t=1t = 1.

The loss L\mathcal{L} gradient with respect to WhW_h at step tt requires multiplying through all earlier Jacobians:

LWh=t=1TLthtk=tT1hk+1hk\frac{\partial \mathcal{L}}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial h_t} \prod_{k=t}^{T-1} \frac{\partial h_{k+1}}{\partial h_k}

Each factor hk+1hk=diag ⁣(1hk2)Wh\frac{\partial h_{k+1}}{\partial h_k} = \text{diag}\!\left(1 - h_k^2\right) W_h involves the derivative of tanh\tanh, which is bounded by [0,1][0, 1].

The Vanishing Gradient Problem

When these Jacobians are repeatedly multiplied, the result shrinks exponentially. After as few as 10 steps, the gradient reaching the earliest tokens is essentially zero — early inputs have no influence on learning.

Gradient Flow Through Time

Relative gradient magnitude reaching each past time step

t-12
t-11
t-10
t-9
t-8
t-7
t-6
t-5
t-4
t-3
t-2
t-1

← Earlier steps · · · · · · · · · Current →

Vanilla RNN: each BPTT step multiplies the gradient by the Jacobian of tanh (≤ 0.42). After 12 steps the gradient is essentially zero — the network can't learn what happened early in the sequence.

Exploding Gradients

Gradients can also explode if Wh|W_h| is large. This is handled in practice by gradient clipping: if >θ\|\nabla\| > \theta, rescale to θ/\theta \cdot \nabla / \|\nabla\|. Vanishing gradients are harder to fix because there is no signal to detect them — the gradient just silently disappears.


3. Long Short-Term Memory (LSTM)

Hochreiter & Schmidhuber (1997) proposed the solution: separate the "memory" from the "working state" and control what is written and read via learned gates.

The Cell State — A Gradient Highway

The key innovation is the cell state ctc_t, a vector that runs horizontally through time with only minor, linear interactions:

ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

Because the update is an addition (not a nonlinear squash), gradients flow back through ctc_t without attenuation as long as the forget gate ft1f_t \approx 1. This is the "constant error carousel" that makes LSTM work.

The Four Computations

All four computations take the same concatenated input [ht1,xt][h_{t-1},\, x_t]:

Forget gate — decides what fraction of the old cell state to keep:

ft=σ ⁣(Wf[ht1,xt]+bf)f_t = \sigma\!\left(W_f [h_{t-1}, x_t] + b_f\right)

Input gate — decides how much of the new candidate to write:

it=σ ⁣(Wi[ht1,xt]+bi)i_t = \sigma\!\left(W_i [h_{t-1}, x_t] + b_i\right)

Candidate cell state — the proposed new values:

c~t=tanh ⁣(Wc[ht1,xt]+bc)\tilde{c}_t = \tanh\!\left(W_c [h_{t-1}, x_t] + b_c\right)

Output gate — filters what part of the cell state becomes the hidden state:

ot=σ ⁣(Wo[ht1,xt]+bo),ht=ottanh(ct)o_t = \sigma\!\left(W_o [h_{t-1}, x_t] + b_o\right), \qquad h_t = o_t \odot \tanh(c_t)

Why σ for gates, tanh for values?

Sigmoid outputs are in [0,1][0, 1] — perfect for a multiplicative gate (0 = closed, 1 = fully open). Tanh outputs are in [1,1][-1, 1], encoding signed values. This is why candidate states use tanh: they represent changes that can increase or decrease the cell state.

Interactive LSTM Cell

Adjust ht1h_{t-1}, xtx_t, and ct1c_{t-1} to see how each gate responds. Step through all six computations:

LSTM Gate Calculator

Adjust inputs — trace data flow through all 6 computation steps

ht1h_{t-1}0.50
xtx_t0.60
ct1c_{t-1}0.80
Forget Gate0.546
ft=σ(Wf[ht1,xt]+bf)f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

σ = 0.546 — retains 55% of c_{t-1} = 0.80.

Input Gate0.630
it=σ(Wi[ht1,xt]+bi)i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
Candidate Cell0.521
c~t=tanh(Wc[ht1,xt]+bc)c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)
New Cell State0.765
ct=ftct1+itc~tc_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t
Output Gate0.613
ot=σ(Wo[ht1,xt]+bo)o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
Hidden State0.395
ht=ottanh(ct)h_t = o_t ⊙ tanh(c_t)
1 / 6

4. Gated Recurrent Unit (GRU)

Cho et al. (2014) proposed a simplified architecture that retains the gating mechanism but eliminates the separate cell state, reducing the number of parameters by ~25%.

Two Gates Instead of Four

Reset gate — controls how much past state influences the candidate:

rt=σ ⁣(Wr[ht1,xt])r_t = \sigma\!\left(W_r [h_{t-1}, x_t]\right)

Update gate — interpolates between old state and new candidate (replaces both forget and input gates):

zt=σ ⁣(Wz[ht1,xt])z_t = \sigma\!\left(W_z [h_{t-1}, x_t]\right)

Candidate hidden state — the reset gate moderates the role of ht1h_{t-1}:

h~t=tanh ⁣(W[rtht1,  xt])\tilde{h}_t = \tanh\!\left(W [r_t \odot h_{t-1},\; x_t]\right)

New hidden state — a convex combination controlled by ztz_t:

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

When zt=0z_t = 0 the cell ignores the input and copies the old state — equivalent to a maximally active forget gate. When zt=1z_t = 1 it discards all prior context and adopts the candidate.

Interactive GRU Cell

GRU Cell Calculator

2 gates, no separate cell state — streamlined design with similar capacity to LSTM

ht1h_{t-1} (prev hidden)0.60
xtx_t (input)0.50
Reset Gate rtr_t
0.571
rt=σ(Wr[ht1,xt])r_t = \sigma(W_r \cdot [h_{t-1}, x_t])

0.571 → uses 57% of previous hidden state when computing candidate.

Update Gate ztz_t
0.643
zt=σ(Wz[ht1,xt])z_t = \sigma(W_z \cdot [h_{t-1}, x_t])

0.643 → mix: 64% new candidate + 36% old state.

Candidate h~t\tilde{h}_t
0.577
h~t=tanh(W[rtht1,xt])\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])

0.577 — proposed new hidden state, modulated by reset gate.

New Hidden hth_t
0.585
ht=(1zt)ht1+zth~th_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

36% × 0.60 + 64% × 0.58 = 0.585

Key insight: The update gate z replaces both forget and input gates from LSTM. When z ≈ 0, the cell copies the old state unchanged. When z ≈ 1, it replaces it with the new candidate entirely. The reset gate r controls how much past context shapes the candidate.

GRU vs LSTM — When Does It Matter?

On most benchmarks, GRU and LSTM perform comparably. LSTM tends to win on tasks requiring very long-range memory or highly structured dependencies (e.g., parsing). GRU wins when training time matters or data is limited, because fewer parameters means less overfitting. In practice, try GRU first and switch to LSTM if performance plateaus.


5. Comparison

RNN Family Comparison

Core update rule

ht=tanh(Whht1+Wxxt+b)h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

Gates

None

Params

1× (baseline)

Memory Span

~5–10 steps

Speed

⚡⚡⚡ Fastest

Strengths

  • Simplest architecture
  • Fewest parameters
  • Fast to train & deploy

Weaknesses

  • Vanishing gradient
  • Can't learn long-range deps
  • Rarely used in production
Best for: Very short sequences, demos, educational contexts

Parameter Count

For a hidden size dd and input size nn, the parameter count per gate is (n+d)×d+d(n + d) \times d + d.

ModelGatesParam count (approx.)
RNN0(n+d)d(n + d) \cdot d
GRU33(n+d)d3 \cdot (n + d) \cdot d
LSTM44(n+d)d4 \cdot (n + d) \cdot d

This is why GRU (3× RNN) trains notably faster than LSTM (4× RNN) at the same hidden dimension.


6. Beyond RNNs

Transformers (Vaswani et al., 2017) have displaced RNNs for most sequence tasks. Instead of processing tokens one-by-one, Transformers attend to all positions in parallel — eliminating the sequential bottleneck and scaling to much larger models.

However, RNNs remain relevant in specific settings:

ScenarioWhy RNN wins
Streaming / online inferenceProcess one token at a time, O(1)O(1) memory per step
Very long sequencesTransformers are O(n2)O(n^2) in attention; RNNs are O(n)O(n)
On-device / edge deploymentSmall constant state, no KV-cache needed
Text-to-speechWaveNet and WaveRNN still use dilated convolutions + recurrence

State Space Models

Structured State Space Models (S4, Mamba) are a recent class of sequence models that combine the O(n)O(n) inference cost of RNNs with the parallelizable training of convolutions. They can be viewed as a learned, structured generalization of the LSTM cell state highway.


Summary

ConceptKey idea
RNN hidden stateA fixed-size summary of all tokens seen so far, updated at each step
Weight sharingSame WhW_h, WxW_x used at every time step — generalizes to any length
Vanishing gradientRepeated Jacobian multiplication shrinks gradients exponentially in vanilla RNNs
LSTM cell stateAdditive update path that carries gradients across long time spans without attenuation
LSTM gatesForget (what to erase), Input + Candidate (what to write), Output (what to expose)
GRU update gateSingle gate that interpolates between "copy old state" and "replace with candidate"
GRU reset gateControls how much past context enters the candidate computation

On this page