Adam: A Method for Stochastic Optimization
Adam (Adaptive Moment Estimation) is an optimization algorithm that adaptively adjusts learning rates per parameter by combining exponential moving averages of gradients (first moment) and squared gradients (second moment), with bias correction to achieve efficient stochastic optimization.
Paper Link
Key Definitions
| Term | Definition |
|---|---|
| First-Order Gradients | - First-order gradients refer to the first derivatives of a function with respect to its parameters. In the context of optimization and machine learning, the first-order gradient of a loss function with respect to the parameter vector is a vector that contains the partial derivatives of the function with respect to each parameter |
| First Moment of the Gradients | - The first moment of the gradients refers to the mean or expected value of the gradients. In the Adam algorithm, it is computed as an exponential moving average of the gradients |
| Second Moment of the Gradients | - The second moment of the gradients refers to the uncentered variance of the gradients, which is the expected value of the squared gradients. In the Adam algorithm, it is computed as an exponential moving average of the squared gradients |
| Step Size Annealing | - Step size annealing (also known as learning rate decay) is a technique where the step size (or learning rate) used in gradient-based optimization algorithms is gradually reduced according to a pre-defined schedule as the training progresses - The primary goal of step size annealing is to improve convergence to a minimum by taking smaller steps as the optimization algorithm approaches an optimal solution. This reduces the risk of overshooting and helps the algorithm settle into a local or global minimum |
Introduction
- Adam (Adaptive Moment Estimation) is an optimization algorithm that helps machine learning models learn faster and more reliably
- Key Advantages
- Adapts learning rates automatically for each parameter
- Works well with noisy data and sparse gradients
- Requires minimal tuning for good results
- Memory efficient as it only needs gradient information
- Analogy: When finding the lowest point in a foggy valley, instead of taking random steps, you would
- Remember which directions seemed to go downhill recently (first moment)
- Remember how consistent those directions were (second moment)
- Take bigger steps when you're confident and smaller steps when uncertain
Algorithm
-
Setup (Initialization)
- Before starting, Adam requires the following hyperparameters:
- (alpha): Learning rate, typically 0.001: controls overall step size
- (beta1): Usually 0.9: how much to trust recent gradients
- (beta2): Usually 0.999: how much to trust recent gradient magnitudes
- (epsilon): Tiny number like : prevents division by zero
- Initialize gradient statistics:
- : Start at zero: will store gradient statistics
- : Iteration counter, starts at 0
- Before starting, Adam requires the following hyperparameters:
-
The Update Loop
- At each training step, Adam performs the following operations:
- Calculate the gradient: Compute — determines which direction parameters should move
- Update momentum (first moment):
- This creates a moving average of recent gradients, smoothing out noise
- Update velocity (second moment):
- This tracks how much gradients have been varying—higher variation means less certainty
- Correct for initialization bias:
- Since we started at zero, early estimates are biased downward. These corrections fix that
- Update parameters:
- Move parameters in the direction of , but scale by how certain we are (based on )
- At each training step, Adam performs the following operations:
-
When you initialize the moment estimates at zero, they start artificially low. Without correction:
- Early updates would be too small
- The algorithm would learn slowly at first
- Performance would suffer unnecessarily
-
The correction terms and grow from near-zero toward 1, compensating for this initial bias. After many iterations, these terms approach 1 and have minimal effect.
Demo: Adam vs SGD playground
Adam optimizer playground
Step through a 1D optimization task and compare Adam to vanilla SGD on the same noisy gradients. Toggle bias correction to see how the step sizes change early in training.
Adam
SGD
Use the controls to change the learning rates, beta values, and gradient noise. Toggle bias correction to see how early Adam steps change relative to SGD.
What to observe:
- The target marker (vertical line) shows the approximate minimum of the function f(x) = 0.2x² + 0.6 sin(1.5x). This is the ideal x value we're trying to reach.
- Watch the x position for Adam vs SGD (the colored dots on the track)
- Compare the step values in each panel: Adam adapts step sizes based on gradient history (via m̂/v̂), while SGD uses a fixed learning rate times the noisy gradient.
- Check Adam's m̂ and v̂ readouts: these show the bias-corrected moments that determine Adam's adaptive step size. Early in training (low t), bias correction makes m̂ and v̂ larger, allowing bigger initial steps.
- With noisy gradients, Adam typically converges more smoothly and reaches the minimum faster than SGD, thanks to its adaptive step sizing.
Key Properties
- Adaptive Step Sizes
- Each parameter gets its own effective learning rate based on its gradient history
- Parameters with:
- Large, consistent gradients → smaller steps (to avoid overshooting)
- Small or noisy gradients → carefully scaled steps
- Natural Annealing
- The effective step size is approximately:
- The ratio is like a signal-to-noise ratio (SNR):
- High SNR → confident gradient estimate → larger steps
- Low SNR → uncertain gradient estimate → smaller steps
- Near the optimum, gradients get smaller and noisier (low SNR), so Adam automatically takes smaller steps—this is "automatic annealing"
- Scale Invariance
- If you multiply all gradients by constant :
- First moment scales by
- Second moment scales by
- The ratio stays the same
- This means Adam behaves consistently regardless of how you scale your loss function or features
- If you multiply all gradients by constant :
- Bounded Step Sizes
- The maximum step size is approximately bounded by:
- With default settings (, ), this is roughly
- This bound provides stability—steps can't explode unexpectedly
Convergence Guarantees
- Adam comes with theoretical guarantees in the online learning framework
- Regret measures how much worse Adam performs compared to the best fixed parameters in hindsight:
- Theorem: Adam achieves regret, meaning:
- as
- Over time, Adam's average performance approaches the best possible fixed strategy. This is as good as the best-known optimization methods in this setting
Comparison with Related Methods
- AdaGrad: Adapts learning rates but accumulates all past gradients, causing learning rates to shrink too aggressively over time
- RMSProp: Like Adam but without momentum (only uses second moment). Works well but converges slower
- Adam = RMSProp + Momentum + Bias Correction: Combines the best of both worlds with proper initialization handling
Code
import numpy as np
def adam_optimizer(f, grad, x0, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, num_iterations=10):
x = x0 # shape is (N, 0)
m = np.zeros_like(x)
v = np.zeros_like(x)
# first moment (mean of past gradients) and second moment (uncentered variance of past gradients)
for t in range(1, num_iterations + 1):
g = grad(x)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
x = x - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
return x