My App
Deep Neural Networks

Understanding Optimizers in Deep Learning

A comprehensive guide to gradient descent variants and optimization algorithms used in training deep neural networks.

The Role of Optimizers

At the heart of deep learning is the concept of minimizing a Loss Function, L(θ)L(\theta), where θ\theta represents the model's parameters (weights and biases). Optimizers dictate how these parameters are updated based on the computed gradients.

They guide the neural network through the high-dimensional loss landscape to find a global (or good local) minimum.

Gradient Descent: The Foundation

The foundational algorithm is Gradient Descent. At each step tt, the parameters are updated in the opposite direction of the gradient of the loss function.

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta L(\theta_t)

Where:

  • θt\theta_t: Parameters at step tt
  • η\eta: Learning rate (step size)
  • θL\nabla_\theta L: Gradient of the loss function

Variants of Gradient Descent

The scale of data used to compute the gradient (θL\nabla_\theta L) defines the variant.

1. Batch Gradient Descent

Computes the gradient over the entire training dataset.

  • Pros: Guaranteed convergence to the global minimum for convex error surfaces.
  • Cons: Extremely slow and intractable for large datasets; does not fit in memory.

2. Stochastic Gradient Descent (SGD)

Computes the gradient and updates parameters for each individual training example.

  • Pros: Frequent updates allow for faster progress; introduces noise that can help escape shallow local minima.
  • Cons: High variance in updates causes the loss trajectory to fluctuate heavily.

3. Mini-Batch Gradient Descent

The sweet spot. Updates are performed on small batches of data (e.g., 32, 64, or 256 samples).

  • Pros: Reduces variance of parameter updates, leading to more stable convergence, whilst utilizing highly optimized matrix operations.

Terminology Note

In modern deep learning literature, "SGD" almost always refers to Mini-Batch Gradient Descent.


Momentum-Based Optimizers

Standard SGD struggles with ravines—areas where the loss surface curves much more steeply in one dimension than in another. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.

Classical Momentum

Mimics a ball rolling down a hill, gaining speed. It accumulates the moving average of past gradients.

vt=γvt1+ηθL(θt)v_t = \gamma v_{t-1} + \eta \nabla_\theta L(\theta_t) θt+1=θtvt\theta_{t+1} = \theta_t - v_t
  • vtv_t: Velocity vector
  • γ\gamma: Momentum term (typically 0.9)

Adaptive Learning Rate Optimizers

Choosing a global learning rate η\eta is challenging. Adaptive optimizers maintain a per-parameter learning rate, which adapts based on the historical gradients for that specific parameter.

AdaGrad

Adapts the learning rate by scaling it inversely proportional to the square root of the sum of all historical squared gradients.

gt,i=θL(θt,i)g_{t,i} = \nabla_\theta L(\theta_{t,i}) θt+1,i=θt,iηGt,ii+ϵgt,i\theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t,i}
  • Issue: The accumulated sum GtG_t continuously grows, causing the learning rate to shrink to zero, stopping learning prematurely.

RMSProp

Proposed by Geoff Hinton to resolve AdaGrad's diminishing learning rate. It uses an exponentially decaying average of squared gradients rather than a cumulative sum.

E[g2]t=βE[g2]t1+(1β)gt2E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2 θt+1=θtηE[g2]t+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot g_t

Adam: Adaptive Moment Estimation

Adam combines the best of both worlds: the velocity of Momentum and the adaptive per-parameter learning rate of RMSProp.

It maintains two exponentially decaying moving averages:

  1. First Moment (mtm_t): The mean of past gradients (like Momentum).
  2. Second Moment (vtv_t): The uncentered variance of past gradients (like RMSProp).

The Adam Algorithm

Update Biased First Moment Estimate

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

Update Biased Second Raw Moment Estimate

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

Compute Bias-Corrected Estimates Because mtm_t and vtv_t are initialized as vectors of 0's, they are biased toward zero. Adam corrects this:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad , \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update Parameters

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \odot \hat{m}_t

Defaults

Default hyperparameters are usually β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, and ϵ=108\epsilon = 10^{-8}. These work exceptionally well across a wide variety of tasks.

AdamW (Adam with Weight Decay)

A fix to Adam's handling of weight decay (L2 regularization). In standard Adam, L2 penalty is added to the gradient, resulting in the regularization being adapted by vtv_t. AdamW directly subtracts weight decay from the parameters after estimating the gradient, leading to vastly superior generalization performance.


Interactive Playground

Surface
Optimizer
LR0.1
Steps: 0|Loss: 19.200
Loading 3D Engine...

Summary of Optimizers

OptimizerMechanismKey AdvantageTypical Use Case
SGDGradient Stepsimple, generalizes wellBaseline, Fine-tuning
SGD + MomentumRunning avg of gradientsEscapes local minima, less oscillationComputer Vision (ResNets)
RMSPropRunning avg of squared gradientsAdaptive per-parameter learning ratesRNNs, LSTMs
AdamCombines Momentum & RMSPropFast convergence, requires less tuningDefault for most DL tasks
AdamWAdam + decoupled weight decayBetter generalization than standard AdamTransformers, LLMs

On this page