Regularization & Stability - Training Networks That Generalize

A model can memorize 60,000 MNIST digits and still fail on the 61st. In the previous posts, we hit 97.8% accuracy on MNIST's clean test set. But that number is deceptively optimistic. Train on a small subset, throw messier data at it, or add a few more layers, and the whole thing unravels. Understanding why - and how to fix it - is what separates models that work in demos from ones that survive contact with reality.

We'll keep evolving the same DigitClassifier, adding dropout, normalization, and residual connections one at a time and measuring what each actually buys us.

The Memorization Problem

When a model performs perfectly on training data but fails on new data, it has overfit: it memorized the noise and peculiarities of the training set instead of learning the underlying pattern.

This is not a theoretical concern. A model with more parameters than training examples can literally memorize every example - just learn to output the correct label whenever it sees that specific noise pattern. GPT-3 has 175 billion parameters. Even with trillions of training tokens, overfitting is a constant pressure.

The entire field of regularization exists to answer one question: how do you force a model to learn general patterns when memorizing specific examples is the path of least resistance?

Weight Decay (L2 Regularization)

The simplest answer: make memorization expensive. Add a penalty for large weights directly to the loss:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \sum_i w_i^2

The $\lambda \sum w_i^2$ term pushes all weights toward zero during every update. Large weights that encode specific memorized patterns get penalized; solutions with small, distributed weights get rewarded. The model is forced toward simpler hypotheses.

L1 regularization ( $\lambda \sum |w_i|$ ) takes this further. While L2 makes weights small, L1 makes weights exactly zero, effectively pruning connections from the network. L1 produces sparse models; L2 produces smooth ones.

In practice, Transformers use weight decay rather than L2. The two are equivalent with SGD, but diverge with adaptive optimizers. AdamW (from the previous post) implements decoupled weight decay: the penalty is applied directly to the weights rather than through the gradient, which is the right behavior with per-parameter learning rates.

python

1
2
3
# weight_decay=0.01 is decoupled L2 regularization.
# Every weight gets multiplied by (1 - lr * weight_decay) each step.
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

Typical values: $\lambda = 0.01$ to $0.1$ . Too much and the model underfits. Too little and it memorizes. But weight decay alone is not enough - it shrinks weights but does not prevent the network from over-specializing its neurons. For that, we need something more aggressive.

Dropout - Forcing Redundancy

Dropout (Srivastava et al., 2014) is deceptively simple: during each training step, randomly set a fraction of neuron outputs to zero.

Typical dropout rate: 10-20% for Transformers, up to 50% for smaller models.

Why does randomly crippling the network while it is learning make it better? Three reasons:

Forces redundancy. If neuron A might disappear on any given step, the network cannot rely on it alone to carry critical information. The same knowledge gets encoded redundantly across many neurons - exactly what we want for generalization.

Implicit ensemble. Each training step uses a different random subset of neurons, effectively training a different subnetwork. The final model averages exponentially many such subnetworks. Ensembles almost always generalize better than individual models.

Breaks co-adaptation. Without dropout, neuron B can learn to only activate when neuron A fires first, creating brittle dependencies. Dropout prevents these fragile partnerships.

During inference, dropout is turned off. All neurons are active, but their outputs are scaled by $(1 - p)$ to compensate for having more active neurons than during training.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
class DigitClassifierWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(784, 128)
        self.dropout = nn.Dropout(0.2)  # zero out 20% of activations
        self.layer2 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = torch.relu(self.layer1(x))
        x = self.dropout(x)  # only active during model.train()
        x = self.layer2(x)
        return x

To see the effect clearly, we need to overfit first. Training on only 1,000 examples makes this easy:

python

1
2
3
4
5
6
# Subset: only 1000 training examples (easy to memorize)
small_train = torch.utils.data.Subset(train_data, range(1000))
small_loader = torch.utils.data.DataLoader(small_train, batch_size=64, shuffle=True)

# Without dropout: train_acc=99.8%, test_acc=91.2% -- memorized the training set
# With dropout:    train_acc=96.1%, test_acc=93.5% -- trades training accuracy for generalization

In Transformers, dropout appears in two specific places: after the attention weights (before multiplying by V), and after each sub-layer output (attention and FFN), right before the residual addition.

Dropout and weight decay together address overfitting. But there is a second, deeper problem with deep networks that has nothing to do with generalization. It has to do with whether you can train them at all.

The Vanishing Gradient Problem

Backpropagation works by multiplying gradients layer by layer. Each layer in the chain contributes a factor:

\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial h_n} \cdot \prod_{k=1}^{n-1} \frac{\partial h_{k+1}}{\partial h_k}

If those factors are consistently less than 1 - which they are with sigmoid or tanh activations - the product decays exponentially toward zero. After 20 layers: $0.9^{20} \approx 0.12$ . After 50 layers: $0.9^{50} \approx 0.005$ . After 100 layers: effectively nothing. Early layers stop receiving any gradient signal and stop learning.

This is why deep networks were considered impractical for most of the 2000s. It is also what killed vanilla RNNs for long sequences: processing 500 tokens is equivalent to passing gradients backward through 500 layers of multiplication.

The exploding gradient problem is the mirror image: when those factors are greater than 1, gradients grow exponentially, causing numerical overflow. Gradient clipping - capping the gradient norm at a threshold, typically 1.0 - is the standard fix, and it is used in virtually every Transformer training run.

python

1
2
# Applied after loss.backward(), before optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Clipping solves explosions. But vanishing gradients require a more fundamental architectural change.

Residual Connections - The Gradient Highway

He et al., 2015 proposed the fix that enabled truly deep networks: skip connections.

\text{output} = \text{layer}(\mathbf{x}) + \mathbf{x}

Instead of learning the full output directly, the layer learns a residual - the difference between input and desired output. The identity shortcut $+ \mathbf{x}$ creates a direct path for gradients to flow backward, bypassing the layer entirely.

The gradient through a residual block is:

\frac{\partial}{\partial \mathbf{x}}[\text{layer}(\mathbf{x}) + \mathbf{x}] = \frac{\partial \text{layer}}{\partial \mathbf{x}} + \mathbf{I}

That $+ \mathbf{I}$ (identity matrix) is the key. Even if $\frac{\partial \text{layer}}{\partial \mathbf{x}} \approx 0$ , the gradient still passes through unchanged. Vanishing gradients become impossible by construction.

This one idea enabled:

ResNet (152 layers, 2015) - won ImageNet with 3.57% error
Transformers (6-96+ layers) - every sub-layer uses residual connections
Modern LLMs (up to 128 layers in some architectures)

Normalization - Preventing Activation Drift

Residual connections solve gradient flow. But even with them, there is a subtler instability: activations can drift to extreme values as they pass through many layers. Small biases accumulate. Layers deeper in the network receive inputs with wildly different statistical properties than shallower layers, making gradient steps inconsistent and training unstable.

The solution is to normalize activations at each layer boundary.

Batch Normalization

Batch Norm (Ioffe & Szegedy, 2015) normalizes across the batch dimension: for each feature, compute mean and variance across all examples in the current batch, then normalize.

\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

Then apply learned scale ( $\gamma$ ) and shift ( $\beta$ ) parameters so the network can undo the normalization if needed.

Batch Norm was revolutionary for CNNs but breaks down for sequences: the statistics depend on the batch, which means behavior differs between training and inference. It also requires large batch sizes to estimate stable statistics.

Layer Normalization

Layer Norm (Ba et al., 2016) normalizes across the feature dimension instead: for each individual example, compute mean and variance across all its features.

\hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}

This is batch-size independent. A sequence of 1 or 1,000 examples gets normalized the same way. During autoregressive generation, where you process exactly one token at a time, this property is non-negotiable. Every Transformer block uses Layer Norm.

The original paper applied it after each sub-layer ("Post-LN"): $\text{LayerNorm}(x + \text{sublayer}(x))$ . Modern implementations typically use "Pre-LN": $x + \text{sublayer}(\text{LayerNorm}(x))$ . Pre-LN is more stable for very deep networks because the normalization happens before the transformation, not after the residual addition.

RMSNorm

RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering step entirely, normalizing only by root mean square:

\hat{x}_i = \frac{x_i}{\sqrt{\frac{1}{d}\sum_j x_j^2 + \epsilon}}

Cheaper to compute (no mean subtraction), and empirically just as effective. Llama 2 and most modern LLMs use RMSNorm in place of full Layer Norm.

Putting the Pattern Together

Each technique so far solves a specific failure mode. The Transformer architecture wires them into a single repeating block that handles all of them simultaneously:

1
output = x + sublayer(LayerNorm(x))

LayerNorm stabilizes the input statistics before each transformation. The residual connection ensures gradients flow directly backward regardless of what the sublayer does. Dropout is applied inside the sublayer to regularize. And weight decay in the optimizer prevents any of the weights from growing large enough to dominate.

Let's build this pattern from scratch on our digit classifier - two residual blocks that mirror the structure of a Transformer FFN:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class DeepDigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.input_proj = nn.Linear(784, 128)

        # Two residual blocks (each mirrors a Transformer FFN layer)
        self.block1_norm = nn.LayerNorm(128)
        self.block1_ffn = nn.Linear(128, 128)
        self.block1_drop = nn.Dropout(0.1)

        self.block2_norm = nn.LayerNorm(128)
        self.block2_ffn = nn.Linear(128, 128)
        self.block2_drop = nn.Dropout(0.1)

        self.output = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = torch.relu(self.input_proj(x))

        # Block 1: Pre-LN -> FFN -> Dropout -> Residual add
        residual = x
        x = self.block1_norm(x)
        x = torch.relu(self.block1_ffn(x))
        x = self.block1_drop(x)
        x = x + residual  # skip connection

        # Block 2: same pattern
        residual = x
        x = self.block2_norm(x)
        x = torch.relu(self.block2_ffn(x))
        x = self.block2_drop(x)
        x = x + residual

        return self.output(x)

# Train with AdamW + cosine schedule
# Result: 98.1% accuracy - our best yet

This is structurally identical to what happens inside every Transformer block, repeated twice per block (once after attention, once after the FFN). The pattern appears twice per block because both sub-layers can drift and both benefit from regularization.

Here is what each technique is actually doing in production systems:

Technique	What it prevents	How it works
Weight decay	Overfitting	Shrinks weights toward zero each step
Dropout	Overfitting, co-adaptation	Randomly disables neurons during training
Gradient clipping	Exploding gradients	Caps gradient norm at a threshold
Residual connections	Vanishing gradients	Adds skip connections around layers
Layer Norm / RMSNorm	Activation drift	Normalizes features per-example

Every modern Transformer uses all five simultaneously. They are not optional extras - they are load-bearing infrastructure. Remove any one and training either overfits, diverges, or stalls.

How far we have come

Step	What changed	Test accuracy
Basic network + SGD	Baseline	93.6%
Swapped to Adam	Better optimizer	97.4%
AdamW + weight decay	Decoupled regularization	97.6%
+ cosine LR schedule	Smoother convergence	97.8%
+ dropout + LayerNorm + residuals	This post	98.1%

From 93.6% to 98.1% without changing the network size or the data. Every technique in these posts contributed. The next question is: how do we apply all of this to language, where the input is a sequence of tokens rather than a fixed image? Continue to Language Modeling & Recurrent Networks.