Understanding Optimizers in Deep Learning
A comprehensive guide to gradient descent variants and optimization algorithms used in training deep neural networks.
The Role of Optimizers
At the heart of deep learning is the concept of minimizing a Loss Function, , where represents the model's parameters (weights and biases). Optimizers dictate how these parameters are updated based on the computed gradients.
They guide the neural network through the high-dimensional loss landscape to find a global (or good local) minimum.
Gradient Descent: The Foundation
The foundational algorithm is Gradient Descent. At each step , the parameters are updated in the opposite direction of the gradient of the loss function.
Where:
- : Parameters at step
- : Learning rate (step size)
- : Gradient of the loss function
Variants of Gradient Descent
The scale of data used to compute the gradient () defines the variant.
1. Batch Gradient Descent
Computes the gradient over the entire training dataset.
- Pros: Guaranteed convergence to the global minimum for convex error surfaces.
- Cons: Extremely slow and intractable for large datasets; does not fit in memory.
2. Stochastic Gradient Descent (SGD)
Computes the gradient and updates parameters for each individual training example.
- Pros: Frequent updates allow for faster progress; introduces noise that can help escape shallow local minima.
- Cons: High variance in updates causes the loss trajectory to fluctuate heavily.
3. Mini-Batch Gradient Descent
The sweet spot. Updates are performed on small batches of data (e.g., 32, 64, or 256 samples).
- Pros: Reduces variance of parameter updates, leading to more stable convergence, whilst utilizing highly optimized matrix operations.
Terminology Note
In modern deep learning literature, "SGD" almost always refers to Mini-Batch Gradient Descent.
Momentum-Based Optimizers
Standard SGD struggles with ravines—areas where the loss surface curves much more steeply in one dimension than in another. Momentum helps accelerate SGD in the relevant direction and dampens oscillations.
Classical Momentum
Mimics a ball rolling down a hill, gaining speed. It accumulates the moving average of past gradients.
- : Velocity vector
- : Momentum term (typically 0.9)
Adaptive Learning Rate Optimizers
Choosing a global learning rate is challenging. Adaptive optimizers maintain a per-parameter learning rate, which adapts based on the historical gradients for that specific parameter.
AdaGrad
Adapts the learning rate by scaling it inversely proportional to the square root of the sum of all historical squared gradients.
- Issue: The accumulated sum continuously grows, causing the learning rate to shrink to zero, stopping learning prematurely.
RMSProp
Proposed by Geoff Hinton to resolve AdaGrad's diminishing learning rate. It uses an exponentially decaying average of squared gradients rather than a cumulative sum.
Adam: Adaptive Moment Estimation
Adam combines the best of both worlds: the velocity of Momentum and the adaptive per-parameter learning rate of RMSProp.
It maintains two exponentially decaying moving averages:
- First Moment (): The mean of past gradients (like Momentum).
- Second Moment (): The uncentered variance of past gradients (like RMSProp).
The Adam Algorithm
Update Biased First Moment Estimate
Update Biased Second Raw Moment Estimate
Compute Bias-Corrected Estimates Because and are initialized as vectors of 0's, they are biased toward zero. Adam corrects this:
Update Parameters
Defaults
Default hyperparameters are usually , , and . These work exceptionally well across a wide variety of tasks.
AdamW (Adam with Weight Decay)
A fix to Adam's handling of weight decay (L2 regularization). In standard Adam, L2 penalty is added to the gradient, resulting in the regularization being adapted by . AdamW directly subtracts weight decay from the parameters after estimating the gradient, leading to vastly superior generalization performance.
Interactive Playground
Summary of Optimizers
| Optimizer | Mechanism | Key Advantage | Typical Use Case |
|---|---|---|---|
| SGD | Gradient Step | simple, generalizes well | Baseline, Fine-tuning |
| SGD + Momentum | Running avg of gradients | Escapes local minima, less oscillation | Computer Vision (ResNets) |
| RMSProp | Running avg of squared gradients | Adaptive per-parameter learning rates | RNNs, LSTMs |
| Adam | Combines Momentum & RMSProp | Fast convergence, requires less tuning | Default for most DL tasks |
| AdamW | Adam + decoupled weight decay | Better generalization than standard Adam | Transformers, LLMs |
Understanding Convolutional Neural Networks
A deep dive into the architecture and mathematics of CNNs — the backbone of modern computer vision.
Dropout: Regularization by Noise
How randomly silencing neurons during training prevents overfitting — with interactive visualizations of masks, rate effects, and the inverted-dropout trick.