Understanding Optimizers in Deep Learning

A comprehensive guide to gradient descent variants and optimization algorithms used in training deep neural networks.

The Role of Optimizers

At the heart of deep learning is the concept of minimizing a Loss Function, $L(\theta)$ , where $\theta$ represents the model's parameters (weights and biases). Optimizers dictate how these parameters are updated based on the computed gradients.

They guide the neural network through the high-dimensional loss landscape to find a global (or good local) minimum.

Gradient Descent: The Foundation

The foundational algorithm is Gradient Descent. At each step $t$ , the parameters are updated in the opposite direction of the gradient of the loss function.

\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta L(\theta_t)

Where:

$\theta_t$ : Parameters at step $t$
$\eta$ : Learning rate (step size)
$\nabla_\theta L$ : Gradient of the loss function

Variants of Gradient Descent

The scale of data used to compute the gradient ( $\nabla_\theta L$ ) defines the variant.

1. Batch Gradient Descent

Computes the gradient over the entire training dataset.

Pros: Guaranteed convergence to the global minimum for convex error surfaces.
Cons: Extremely slow and intractable for large datasets; does not fit in memory.

2. Stochastic Gradient Descent (SGD)

Computes the gradient and updates parameters for each individual training example.

Pros: Frequent updates allow for faster progress; introduces noise that can help escape shallow local minima.
Cons: High variance in updates causes the loss trajectory to fluctuate heavily.

3. Mini-Batch Gradient Descent

The sweet spot. Updates are performed on small batches of data (e.g., 32, 64, or 256 samples).

Pros: Reduces variance of parameter updates, leading to more stable convergence, whilst utilizing highly optimized matrix operations.

Terminology Note

In modern deep learning literature, "SGD" almost always refers to Mini-Batch Gradient Descent.

Momentum-Based Optimizers

Standard SGD struggles with ravines—areas where the loss surface curves much more steeply in one dimension than in another. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.

Classical Momentum

Mimics a ball rolling down a hill, gaining speed. It accumulates the moving average of past gradients.

v_t = \gamma v_{t-1} + \eta \nabla_\theta L(\theta_t)

\theta_{t+1} = \theta_t - v_t

$v_t$ : Velocity vector
$\gamma$ : Momentum term (typically 0.9)

Adaptive Learning Rate Optimizers

Choosing a global learning rate $\eta$ is challenging. Adaptive optimizers maintain a per-parameter learning rate, which adapts based on the historical gradients for that specific parameter.

AdaGrad

Adapts the learning rate by scaling it inversely proportional to the square root of the sum of all historical squared gradients.

g_{t,i} = \nabla_\theta L(\theta_{t,i})

\theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t,i}

Issue: The accumulated sum $G_t$ continuously grows, causing the learning rate to shrink to zero, stopping learning prematurely.

RMSProp

Proposed by Geoff Hinton to resolve AdaGrad's diminishing learning rate. It uses an exponentially decaying average of squared gradients rather than a cumulative sum.

E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t

Adam: Adaptive Moment Estimation

Adam combines the best of both worlds: the velocity of Momentum and the adaptive per-parameter learning rate of RMSProp.

It maintains two exponentially decaying moving averages:

First Moment ( $m_t$ ): The mean of past gradients (like Momentum).
Second Moment ( $v_t$ ): The uncentered variance of past gradients (like RMSProp).

The Adam Algorithm

Update Biased First Moment Estimate

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

Update Biased Second Raw Moment Estimate

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Compute Bias-Corrected Estimates Because $m_t$ and $v_t$ are initialized as vectors of 0's, they are biased toward zero. Adam corrects this:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad , \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update Parameters

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t

Defaults

Default hyperparameters are usually $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , and $\epsilon = 10^{-8}$ . These work exceptionally well across a wide variety of tasks.

AdamW (Adam with Weight Decay)

A fix to Adam's handling of weight decay (L2 regularization). In standard Adam, L2 penalty is added to the gradient, resulting in the regularization being adapted by $v_t$ . AdamW directly subtracts weight decay from the parameters after estimating the gradient, leading to vastly superior generalization performance.

Interactive Playground

Surface

Optimizer

LR0.1

Steps: 0|Loss: 19.200

Loading 3D Engine...

f(x, y) = x²/20 + y²

A classic "ravine" optimization problem where SGD tends to oscillate.

Summary of Optimizers

Optimizer	Mechanism	Key Advantage	Typical Use Case
SGD	Gradient Step	simple, generalizes well	Baseline, Fine-tuning
SGD + Momentum	Running avg of gradients	Escapes local minima, less oscillation	Computer Vision (ResNets)
RMSProp	Running avg of squared gradients	Adaptive per-parameter learning rates	RNNs, LSTMs
Adam	Combines Momentum & RMSProp	Fast convergence, requires less tuning	Default for most DL tasks
AdamW	Adam + decoupled weight decay	Better generalization than standard Adam	Transformers, LLMs

Understanding Optimizers in Deep Learning

On this page