Optimizers & Training - Making Neural Networks Learn Faster

Swapping one line of code pushed our digit classifier from 93.6% to 97.4% accuracy. Same model, same data, same training loop. The only change was the optimizer.

That gap is worth understanding. Optimization is the part of deep learning most practitioners treat as a black box - pick Adam, tune the learning rate, ship it. But the optimizer is doing something profound: it's navigating a loss landscape with billions of dimensions, where the wrong strategy gets you stuck in a ravine while the right one finds paths the gradient alone would never reveal.

Why Vanilla SGD Fails

Stochastic Gradient Descent computes the gradient on a small batch and steps in the direction of steepest descent:

w_{t+1} = w_t - \eta \cdot \nabla L(w_t)

The learning rate $\eta$ controls step size. This is where the trouble starts.

Set $\eta$ too large and you overshoot the minimum, bouncing chaotically. Set it too small and training crawls. The right $\eta$ at step 1 (large loss, rough landscape) is different from the right $\eta$ at step 10,000 (near a minimum, sensitive terrain). A single global value can't be right for both.

The second problem is geometry. Real loss surfaces have narrow ravines: steep walls in one direction, a shallow slope in the direction you want to move. SGD takes big steps across the ravine and tiny steps along it, oscillating wildly instead of making progress.

The third problem is more subtle. Not all parameters are equally well-trained. An embedding for a rare word updates rarely - when it does, it needs a big step. Common-word embeddings update constantly and need smaller ones. SGD applies the same $\eta$ to every parameter regardless of how often or how much they've moved.

These three failures - the learning rate dilemma, geometric inefficiency, and uniform treatment of parameters - all have different solutions. Momentum addresses the geometry. Adam addresses all three at once.

Momentum: Giving the Optimizer a Memory

The idea: instead of reacting only to the current gradient, accumulate a velocity that builds over time.

v_t = \beta \cdot v_{t-1} + \nabla L(w_t)

w_{t+1} = w_t - \eta \cdot v_t

With $\beta = 0.9$ , 90% of the previous velocity carries forward. In a ravine, the cross-ravine gradients alternate in sign at each step and cancel out in the accumulated velocity. The along-ravine gradients consistently point the same direction and build up. The optimizer rolls smoothly down the valley instead of bouncing off the walls.

This also helps with saddle points. A saddle point has zero gradient, which brings vanilla SGD to a halt. But an optimizer with accumulated velocity can coast through small flat regions, like a ball rolling over a gentle bump instead of stopping at it.

Momentum was first proposed by Polyak in 1964 and later shown by Sutskever et al. (2013) to be critical for training deep networks. It solved the geometry problem, but it still applied the same adaptive scaling to every parameter. That required a different fix.

Adam: One Optimizer to Rule Them All

Adam (Kingma & Ba, 2015) adds a second running average: the squared gradient. This gives it per-parameter scaling.

It maintains two moments:

m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t

v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

$m_t$ is the exponential moving average of gradients - the momentum term. $v_t$ is the exponential moving average of squared gradients. It tracks how large the gradients have been for each parameter over time.

Because both start at zero, the early estimates are biased toward zero. Bias correction fixes this:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Then the update:

w_{t+1} = w_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

The division by $\sqrt{\hat{v}_t}$ is the payoff. A parameter that has consistently received large gradients has a large $\hat{v}_t$ , so it gets a smaller effective step. A parameter that has rarely updated has a small $\hat{v}_t$ , so it gets a larger step. The optimizer automatically calibrates step sizes to each parameter's history.

Default hyperparameters ( $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\epsilon = 10^{-8}$ ) work well across an enormous range of tasks. This is rare in deep learning, and it's why Adam became the default optimizer almost everywhere.

One line changes everything

Remember our digit classifier? Let's swap the optimizer:

python

1
2
3
4
5
6
7
# Before (from Blog 1): vanilla SGD
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Result: 93.6% after 5 epochs

# After: Adam
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Result: 97.4% after 5 epochs

Same model, same data, same training loop. Just a different optimizer. The results:

1
2
SGD:  Epoch 5 -> loss=0.3198, accuracy=93.6%
Adam: Epoch 5 -> loss=0.0812, accuracy=97.4%

Adam converges faster (lower loss) and generalizes better (higher test accuracy). The adaptive learning rates let it make larger updates for undertrained parameters and smaller updates for well-trained ones.

Seeing the difference

Watch all three optimizers race to the minimum on the same loss landscape. SGD crawls and oscillates, Momentum builds speed but overshoots, Adam finds the shortest path:

AdamW: Fixing a Subtle Bug

Adam's success created a hidden problem. Standard L2 regularization adds $\lambda \|w\|^2$ to the loss, which adds $\lambda w$ to the gradient. Adam then scales this gradient term by $1/\sqrt{\hat{v}_t}$ along with everything else - which means the effective regularization strength varies per parameter depending on gradient history. Parameters with historically large gradients get weaker regularization than parameters with small gradients. This breaks the intended behavior of weight decay entirely.

Loshchilov & Hutter (2019) fixed this by decoupling weight decay from the gradient update:

w_{t+1} = (1 - \lambda) \cdot w_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

The $(1 - \lambda)$ factor shrinks weights directly, before the adaptive step. Weight decay now applies uniformly, independent of gradient history. This is AdamW, and the seemingly small change improved generalization noticeably - enough that it became the standard optimizer for Transformer training. GPT, BERT, and LLaMA all use it.

python

1
2
3
4
5
6
7
# AdamW with weight decay
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=0.01  # the decoupled weight decay
)
# Result: 97.6% after 5 epochs (slight improvement over Adam)

Loss Functions: What Is the Optimizer Minimizing?

The optimizer navigates a loss landscape. The loss function defines that landscape. Choose the wrong one and you're optimizing for the wrong thing, no matter how good your optimizer is.

Cross-Entropy ( $-\log p$ ) is the standard for classification. It barely penalizes confident correct predictions but explodes when the model is confidently wrong. At $p = 0.01$ (very wrong), the loss is 4.6 - compared to just 0.98 for MSE. This harsh penalty for overconfident mistakes is exactly what you want when training a language model: a model that assigns $p = 0.01$ to the correct next token should suffer severely.

Mean Squared Error ( $(1-p)^2$ ) is gentler. It penalizes large errors more than small ones, but not as aggressively as cross-entropy. It's standard for regression tasks where you're predicting a continuous value.

Mean Absolute Error ( $|1-p|$ ) has a constant gradient regardless of error magnitude. It's robust to outliers - a single terrible prediction doesn't dominate the gradient - but converges more slowly because it doesn't accelerate near the optimum.

The interaction between loss function and optimizer matters. Cross-entropy's steep gradient for wrong predictions pairs well with Adam's adaptive scaling: parameters driving confident mistakes get large, targeted updates. MSE's gentler gradient is a better fit for SGD on smooth regression landscapes.

Learning Rate Schedules

Even with Adam and a well-chosen loss function, a constant learning rate is leaving performance on the table. Early in training, the loss landscape is rough and gradients are noisy - you want large steps to explore. Late in training, you're near a minimum and need precise, small steps to settle in. One learning rate cannot be optimal for both phases.

Warmup + cosine decay became the standard for Transformer training with Vaswani et al. (2017):

Linear warmup (first ~1000 steps): the learning rate ramps from 0 to the peak. The model starts with randomly initialized weights - if you take large steps immediately, the first few batches cause destructive updates that the rest of training has to undo. Warmup lets the optimizer stabilize before taking full-sized steps.
Cosine decay: the learning rate follows a cosine curve from peak to near-zero. Smooth and gradual - no sudden drops that could kick the parameters out of the minimum they've settled into.

\eta_t = \eta_{\max} \cdot \frac{1}{2}\left(1 + \cos\left(\frac{\pi \cdot t}{T}\right)\right)

Other schedules exist - step decay, exponential decay, cyclical learning rates (Smith, 2017) - but warmup + cosine is the de facto standard for large language models.

Adding a scheduler to our MNIST training:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Cosine decay over the full training run
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=5 * len(train_loader),  # total steps
)

for epoch in range(5):
    for images, labels in train_loader:
        logits = model(images)
        loss = F.cross_entropy(logits, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()  # update LR after each batch

# Result: 97.8% - the schedule helps squeeze out the last fraction

What LLM Training Actually Uses

The stack described above - AdamW plus warmup plus cosine decay - is exactly what large language models use, with a few additions:

AdamW with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ (slightly lower than Adam's default 0.999; the smaller value makes the second moment estimate respond faster, which helps at large scale where gradient statistics shift)
Warmup + cosine decay with peak learning rate around $3 \times 10^{-4}$ for smaller models, $1.5 \times 10^{-4}$ for larger ones
Gradient clipping at norm 1.0 - this caps the gradient magnitude before the optimizer step, preventing the occasional enormous gradient spike from destroying the weight values trained over thousands of steps
Mixed precision (fp16/bf16 for forward/backward, fp32 for optimizer states) halves memory and roughly doubles throughput, since the optimizer states are where numerical precision actually matters

Recent alternatives challenge Adam's dominance on specific fronts. Lion (Chen et al., 2023) uses only the sign of the gradient rather than its magnitude, reducing memory by skipping second moment storage. It's competitive with Adam at lower cost. Sophia (Liu et al., 2023) uses a diagonal Hessian estimate to get more precise curvature information and claims 2x faster convergence on LLM pre-training. LAMB (You et al., 2020) scales learning rates by layer norm, enabling much larger batch sizes (up to 64K) without accuracy degradation.

None have consistently unseated AdamW. It's well-understood, well-tuned across a decade of use, and any alternative has to clear a high bar to beat something that already just works.

The techniques above are responsible for getting parameters to good values. Keeping them there - and preventing the network from collapsing, exploding, or simply memorizing the training set - is a different problem. That's what regularization and normalization are for. Continue to Regularization & Stability.