Deep Learning from First Principles

A 7-billion-parameter language model is, at its core, a single mathematical operation repeated millions of times: multiply a vector by a matrix, add a bias, apply a nonlinearity. That's it. The gap between that simple operation and a model that writes code, reasons about philosophy, and translates between languages is almost entirely a question of scale and what you train it on.

Understanding that gap - really understanding it, not just accepting it - requires building from the bottom. So that's what we'll do. The running example is a network that classifies handwritten digits. It's small enough to train on a laptop CPU in 30 seconds, and every concept we introduce is something you can watch affect the training curve.

Vectors

Everything in deep learning is a vector. Not figuratively - literally. An MNIST image is a vector. A word is a vector. A sentence is a sequence of vectors. The entire state of a neural network at inference time is a chain of vector transformations.

A vector is an ordered list of numbers. A single number (a scalar) describes one quantity - 72 degrees Fahrenheit. Most things in the world need more. A map position needs two numbers. A color needs three. A word's meaning in a language model needs 768 or more, each dimension capturing some aspect of how that word is used in context.

Direction and magnitude

Vectors have two properties that matter. Direction encodes what the vector represents - which combination of features it emphasizes. Magnitude encodes how strongly.

A vector $[1, 0]$ points purely along the first axis. $[0, 1]$ purely along the second. $[1, 1]$ points between them with magnitude $\sqrt{1^2 + 1^2} = \sqrt{2} \approx 1.41$ . The magnitude formula extends to any dimension:

\|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}

The dimensionality trick

The math works identically in any number of dimensions. A dot product between two 768-dimensional vectors follows the same formula as between two 2D vectors - just more terms to sum. This is not a special property of neural networks; it's linear algebra. But it has a profound consequence: we can represent meaning, syntax, and world knowledge as points in a very high-dimensional space, and all the geometric intuitions we have in 2D still apply.

There is a useful counterintuitive fact about high-dimensional spaces: almost all pairs of random vectors are nearly perpendicular. In 768 dimensions there is a vast amount of room for vectors to be distinct from each other. This is why high-dimensional embeddings can encode so much information without interfering.

python

1
2
3
4
5
6
7
8
9
10
11
import torch

# An MNIST image is a 784-dimensional vector
# Each dimension is a pixel brightness from 0.0 (black) to 1.0 (white)
pixel_values = torch.randn(784)
print(pixel_values.shape)  # torch.Size([784])
print(pixel_values[:5])    # first 5 pixel values

# A color is a 3-dimensional vector
orange = torch.tensor([255.0, 165.0, 0.0])
print(orange.shape)  # torch.Size([3])

The entire job of a neural network is to take one vector and transform it into another: pixels in, digit label out. But to transform a vector, you need an operation. That operation is the dot product.

Dot Products - Measuring Similarity

The most important operation

The dot product takes two vectors of the same length and produces a single number. Multiply corresponding elements, then sum:

\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n = \sum_{i=1}^n a_i b_i

For example: $[1, 2, 3] \cdot [4, 5, 6] = 1 \times 4 + 2 \times 5 + 3 \times 6 = 32$

The geometric interpretation

The dot product has a geometric meaning that makes it indispensable:

\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \cdot \|\mathbf{b}\| \cdot \cos(\theta)

where $\theta$ is the angle between the two vectors. The result:

Same direction ( $\theta = 0°$ , $\cos = 1$ ): maximum positive value. The vectors agree completely.
Perpendicular ( $\theta = 90°$ , $\cos = 0$ ): zero. The vectors share no information.
Opposite directions ( $\theta = 180°$ , $\cos = -1$ ): maximum negative value. The vectors disagree.

This is why the dot product is the foundation of attention in Transformers: it measures how similar two vectors are. The query vector "asks a question" and the key vector "offers an answer." Their dot product measures how well the answer matches the question.

Cosine similarity

If you care about direction but not magnitude, normalize by dividing out the magnitudes:

\text{cosine similarity} = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|} \in [-1, 1]

This is used extensively in NLP to compare word embeddings. "King" and "queen" have high cosine similarity because they point in similar directions in embedding space, even if they differ in magnitude.

Every neuron in a neural network computes a dot product. A neuron with weights $[w_1, w_2, w_3]$ receiving inputs $[x_1, x_2, x_3]$ computes $w_1 x_1 + w_2 x_2 + w_3 x_3$ . The neuron is measuring how similar the input is to its learned weight pattern. Inputs that match produce large outputs; inputs that don't match produce small ones.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

# Dot product: 1*4 + 2*5 + 3*6 = 32
dot = torch.dot(a, b)
print(f"Dot product: {dot}")  # 32.0

# Cosine similarity (normalized dot product)
cos_sim = torch.nn.functional.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0))
print(f"Cosine similarity: {cos_sim.item():.4f}")  # 0.9746 (very similar direction)

# Two perpendicular vectors
c = torch.tensor([1.0, 0.0])
d = torch.tensor([0.0, 1.0])
print(f"Perpendicular dot: {torch.dot(c, d)}")  # 0.0

A single dot product produces a single number: one neuron's response to one input. To run an entire layer of neurons at once, we need to compute many dot products simultaneously. That's matrix multiplication.

Matrix Multiplication

Many dot products at once

A matrix is a 2D grid of numbers. Matrix multiplication is the operation that makes neural networks computationally feasible: it computes many dot products in parallel.

When you multiply matrix $A$ (shape $m \times n$ ) by matrix $B$ (shape $n \times p$ ), each element of the result is the dot product of one row from $A$ with one column from $B$ . The result has shape $m \times p$ .

The dimension rule: $(m \times n) \times (n \times p) = (m \times p)$ . The inner dimension $n$ must match - that's the shared length of each dot product.

Every layer of a neural network is fundamentally a matrix multiplication. A layer with 784 inputs and 128 outputs has a weight matrix of shape $784 \times 128$ . Feed in one input vector and you compute 128 dot products simultaneously - one per output neuron.

The real power comes from batching. Stack 64 input vectors into a matrix of shape $64 \times 784$ , multiply by the weight matrix ( $784 \times 128$ ), and get a $64 \times 128$ result - all 64 inputs processed through all 128 neurons in a single operation. This is why GPUs, built for parallel matrix math, matter so much. A forward pass isn't a loop over examples; it's one matrix multiply.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
# 3 input features, projected to 4 outputs
x = torch.tensor([[0.2, 0.8, -0.1]])  # shape: (1, 3)
W = torch.randn(3, 4)                  # shape: (3, 4)

output = x @ W  # matrix multiply, shape: (1, 4)
print(f"Input shape:  {x.shape}")       # (1, 3)
print(f"Weight shape: {W.shape}")       # (3, 4)
print(f"Output shape: {output.shape}")  # (1, 4)

# Batched: 64 inputs processed at once
batch = torch.randn(64, 3)    # 64 inputs, each 3-dimensional
output = batch @ W             # shape: (64, 4)
print(f"Batch output: {output.shape}")  # (64, 4) - all 64 done at once

Multiplying a vector by a matrix doesn't just scale it - it transforms it geometrically: rotating, projecting, or compressing it. A $784 \times 128$ matrix transforms 784-dimensional space into 128-dimensional space. Each layer of a neural network applies a different transformation. The network learns a sequence of transformations that, composed together, map pixel space to digit-label space.

To compose many layers efficiently, we need one more abstraction: tensors.

Tensors - Thinking Beyond 2D

The generalization of matrices

In real deep learning code you won't see "matrix" or "vector." You'll see tensor - a generalization to any number of dimensions:

A scalar is a 0D tensor (a single number)
A vector is a 1D tensor (a list)
A matrix is a 2D tensor (a grid)
An nD tensor extends to any number of axes

Why we need 3D (and beyond)

Training a model one example at a time is painfully slow. Modern GPUs have thousands of cores designed for parallel computation. To use them, we process data in batches - feeding dozens or hundreds of examples through the network simultaneously.

In a Transformer, the primary working tensor has three dimensions:

Batch Size (e.g., 32): how many independent sequences the GPU processes in parallel
Sequence Length (e.g., 512): how many tokens in each sequence
Embedding Dimension (e.g., 768): the size of each token's vector representation

A single forward pass through a Transformer processes $32 \times 512 \times 768 = 12.5$ million numbers simultaneously. This is only possible because GPUs are built for exactly this kind of parallel tensor math.

PyTorch handles tensor operations intelligently through broadcasting - automatically expanding smaller tensors to match larger ones when the shapes are compatible:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# A batch of 32 MNIST images, each 28x28 pixels flattened to 784
batch = torch.randn(32, 784)
print(batch.shape)  # torch.Size([32, 784])

# Weight matrix: 784 inputs -> 128 outputs
W = torch.randn(784, 128)

# Matrix multiply processes ALL 32 images at once
output = batch @ W
print(output.shape)  # torch.Size([32, 128])

# Bias is a 1D tensor of 128 values - broadcasts across the batch
bias = torch.randn(128)
output = output + bias  # adds the same bias to all 32 outputs
print(output.shape)  # torch.Size([32, 128])

Now we have the machinery to describe a single computation unit: the neuron. But there's a problem we haven't addressed yet. Stacking linear transformations is useless - mathematically, any stack of matrix multiplications collapses to a single matrix multiplication. To learn anything interesting, we need nonlinearity.

The Neuron - A Dot Product with a Twist

From math to computation

A neuron computes the dot product of its inputs with its learned weights, adds a bias, and passes the result through a nonlinear activation function:

\text{output} = f\left(\sum_{i=1}^n w_i x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b)

The weights $\mathbf{w}$ determine what input pattern the neuron responds to. The bias $b$ shifts the activation threshold. The activation function $f$ is what separates a stack of matrix multiplications from something that can actually learn.

Why non-linearity matters: the XOR problem

Consider the XOR function: it outputs 1 when exactly one of two inputs is 1, and 0 otherwise.

$x_1$	$x_2$	XOR
0	0	0
0	1	1
1	0	1
1	1	0

A single neuron without activation computes $w_1 x_1 + w_2 x_2 + b$ - a straight line in 2D space. XOR cannot be separated by any straight line. The positive examples (0,1) and (1,0) sit on opposite sides of any line you could draw. No values of $w_1$ , $w_2$ , and $b$ can solve it.

Two neurons with a nonlinear activation, feeding into a third, can solve XOR. The first layer creates a new representation where the problem becomes linearly separable. This is the core insight of deep learning: each layer creates a new representation of the data that makes the problem easier for the next layer. It's also why the 1969 proof that single neurons can't learn XOR triggered a decade-long funding drought - the community hadn't yet realized that the fix was to stack them.

Activation functions

Without activation functions, stacking layers is pointless. A composition of linear functions is still linear: $f(x) = W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)$ - one matrix multiplication, regardless of depth. Activation functions break this collapse.

ReLU (Rectified Linear Unit) - $f(x) = \max(0, x)$

The simplest and most widely used. Positive values pass through unchanged; negative values become zero. ReLU solved the vanishing gradient problem that plagued sigmoid and tanh networks - its gradient is either 0 or 1, never a fraction that compounds to zero across many layers. The downside: "dead neurons." A neuron whose output is always negative has a permanently zero gradient and stops learning. Variants like Leaky ReLU ( $f(x) = \max(0.01x, x)$ ) and GELU ( $f(x) = x \cdot \Phi(x)$ , used in Transformers) address this.

Sigmoid - $f(x) = \frac{1}{1 + e^{-x}}$

Squashes any real number to $(0, 1)$ . Historically important but rarely used in hidden layers now because its gradient vanishes for large inputs ( $f'(x) \to 0$ as $|x| \to \infty$ ). Still used in the output layer for binary classification, where the output needs to be interpretable as a probability.

Tanh - $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

Like sigmoid but centered at zero, outputting in $(-1, 1)$ . The zero-centering helps gradient flow compared to sigmoid, but it still suffers from vanishing gradients at extremes.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch.nn as nn

# A single neuron: 3 inputs, 1 output
neuron = nn.Linear(3, 1)  # creates weights (3,) and bias (1,)

x = torch.tensor([[1.0, 0.5, -0.2]])

# Without activation: just a dot product + bias (linear)
z = neuron(x)
print(f"Linear output: {z.item():.4f}")

# With ReLU: max(0, z)
output = torch.relu(z)
print(f"After ReLU: {output.item():.4f}")

# Inspect the learned weights
print(f"Weights: {neuron.weight.data}")
print(f"Bias: {neuron.bias.data}")

A single neuron detects one pattern. To detect many patterns simultaneously, we run many neurons in parallel - that's a layer. To detect patterns of patterns, we stack layers. That's a deep network.

Deep Neural Networks - Stacking Layers

From one neuron to many

A layer of neurons computes many dot products in parallel - it detects many patterns simultaneously. A deep network stacks multiple layers, where each layer builds on the representations created by the previous one.

Each layer applies:

A matrix multiplication ( $W \cdot x$ ) - the weights
A bias addition ( $+ b$ )
An activation function ( $f(\cdot)$ ) - the nonlinearity

Mathematically: $\mathbf{h} = f(W\mathbf{x} + \mathbf{b})$ , where $\mathbf{h}$ is the layer's output ("hidden representation").

Counting parameters

Every connection between nodes is a weight. Every node in a hidden or output layer has a bias. For a network with layers $[3, 5, 5, 2]$ :

Layer 1 (input to hidden 1): $3 \times 5 + 5 = 20$ parameters
Layer 2 (hidden 1 to hidden 2): $5 \times 5 + 5 = 30$ parameters
Layer 3 (hidden 2 to output): $5 \times 2 + 2 = 12$ parameters
Total: 62 parameters

The general formula for a fully connected layer: $\text{params} = \text{inputs} \times \text{outputs} + \text{outputs}$

GPT-3 has 175 billion parameters. The structure is identical - weights and biases in layers. The scale is not.

Why depth works: hierarchical feature learning

Cybenko (1989) proved the universal approximation theorem: a network with a single hidden layer can approximate any continuous function given enough neurons. So why use depth at all?

Because depth is exponentially more efficient than width. A shallow network might need millions of neurons to learn a complex function. A deep network can learn the same function with far fewer parameters by building hierarchical representations:

Layer 1 detects edges and gradients in an image
Layer 2 combines edges into corners, curves, and textures
Layer 3 combines those into parts - eyes, noses, ears
Layer 4 combines parts into faces

Each layer reuses simpler features learned by the layer before it. Compositionality is what makes depth worth the training complexity.

The forward pass step by step

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# A minimal network: 2 inputs -> 3 hidden -> 2 outputs
layer1 = nn.Linear(2, 3)
layer2 = nn.Linear(3, 2)

# Override weights for demonstration
with torch.no_grad():
    layer1.weight.copy_(torch.tensor([[0.5, -0.3], [0.2, 0.8], [-0.4, 0.6]]))
    layer1.bias.copy_(torch.tensor([0.1, -0.1, 0.2]))
    layer2.weight.copy_(torch.tensor([[0.7, -0.2, 0.5], [-0.3, 0.9, 0.1]]))
    layer2.bias.copy_(torch.tensor([0.0, 0.1]))

x = torch.tensor([[1.0, 0.5]])

# Step 1: Linear transformation
z1 = x @ layer1.weight.T + layer1.bias
print(f"After layer1 linear: {z1}")  # [0.45, 0.50, 0.10]

# Step 2: Activation (ReLU)
h1 = torch.relu(z1)
print(f"After ReLU: {h1}")  # [0.45, 0.50, 0.10] (all positive, unchanged)

# Step 3: Second linear transformation
z2 = h1 @ layer2.weight.T + layer2.bias
print(f"After layer2 linear: {z2}")  # [0.27, 0.42]

# This is the forward pass. The network transformed [1.0, 0.5] into [0.27, 0.42].

Our MNIST model in PyTorch

MNIST images are $28 \times 28 = 784$ pixels. We classify them into 10 digits (0-9). A two-layer network is enough:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import torch
import torch.nn as nn

class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(784, 128)  # 784 pixels -> 128 hidden neurons
        self.layer2 = nn.Linear(128, 10)   # 128 hidden -> 10 digit classes

    def forward(self, x):
        x = x.view(-1, 784)             # flatten 28x28 image to a 784-vector
        x = torch.relu(self.layer1(x))   # layer 1: linear + ReLU
        x = self.layer2(x)               # layer 2: linear (raw logits)
        return x

model = DigitClassifier()

# Count parameters: 784*128 + 128 + 128*10 + 10 = 101,770
total_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total_params:,}")  # 101,770

# Test forward pass with a random image
fake_image = torch.randn(1, 1, 28, 28)
logits = model(fake_image)
print(f"Output shape: {logits.shape}")  # (1, 10) - one score per digit class
print(f"Raw logits: {logits.data}")     # 10 arbitrary numbers

The model has 101,770 learnable parameters. The forward method is two matrix multiplications with a ReLU between them - exactly the math we've been building up. But its output - 10 raw numbers called logits - is not yet interpretable as a probability. That requires one more function.

The Softmax Function

From logits to probabilities

Our network outputs 10 raw numbers per input. These can be any value: positive, negative, large, small. To interpret them as "the model thinks this is probably a 7," we need to convert them into a probability distribution: all values between 0 and 1, summing to exactly 1.

Softmax does this in two steps:

Exponentiate each value (making everything positive)
Divide by the total (normalizing to sum to 1)

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{K} e^{x_j}}

Why exponential?

The exponential function amplifies differences. If two logits differ by 2, their exponentials differ by a factor of $e^2 \approx 7.4$ . The largest logit dominates the probability distribution. The model becomes decisive.

This amplification is controlled by temperature. Dividing logits by a temperature $T$ before softmax adjusts how peaked the distribution is:

\text{softmax}(x_i; T) = \frac{e^{x_i / T}}{\sum_j e^{x_j / T}}

$T \to 0$ : all probability on the largest logit (completely confident)
$T = 1$ : standard softmax
$T \to \infty$ : uniform distribution (completely uncertain)

The $1/\sqrt{d_k}$ scaling in Transformer attention is effectively a temperature parameter. Without it, dot products between high-dimensional vectors grow large enough that softmax outputs near-binary distributions, killing gradient flow during training.

Numerical stability

Computing $e^{1000}$ overflows to infinity. The fix: subtract the maximum logit before exponentiating.

\text{softmax}(x_i) = \frac{e^{x_i - \max(\mathbf{x})}}{\sum_j e^{x_j - \max(\mathbf{x})}}

This is mathematically identical (the max cancels out) but keeps all exponents in a reasonable range. PyTorch handles this automatically inside F.cross_entropy.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Our model's raw output
logits = model(torch.randn(1, 1, 28, 28))
print(f"Raw logits: {logits.data.squeeze()}")

# Convert to probabilities
probs = torch.softmax(logits, dim=1)
print(f"Probabilities: {probs.data.squeeze()}")
print(f"Sum: {probs.sum().item():.6f}")  # 1.000000

# The model's prediction
predicted_digit = probs.argmax().item()
confidence = probs.max().item()
print(f"Prediction: digit {predicted_digit} with {confidence:.1%} confidence")
# Untrained model: roughly 10% confidence for each digit (random guessing)

The model now produces a probability distribution. The question is: how do we know how wrong it is? And how do we make it less wrong?

How Neural Networks Learn

The training problem

We have a model with 101,770 parameters (weights and biases). Currently they're random, so the model guesses randomly - about 10% accuracy on 10 classes. We need to find values for these parameters that produce correct predictions. But the search space has more configurations than atoms in the universe.

The solution: gradient descent. Instead of searching blindly, we start with random parameters and iteratively improve them by following the gradient of a loss function - a measure of how wrong we are.

Loss functions: measuring how wrong you are

A loss function takes the model's prediction and the true answer and outputs a single number measuring how wrong the prediction is. Lower is better. Zero means perfect.

For classification, the standard is cross-entropy loss:

\mathcal{L} = -\log(p_{\text{correct}})

where $p_{\text{correct}}$ is the probability the model assigns to the correct class.

The shape of $-\log(x)$ is what makes this work:

If $p = 0.99$ (very confident and correct): $\mathcal{L} = -\log(0.99) = 0.01$ - tiny loss
If $p = 0.5$ (uncertain): $\mathcal{L} = -\log(0.5) = 0.69$ - moderate loss
If $p = 0.01$ (very confident and wrong): $\mathcal{L} = -\log(0.01) = 4.6$ - enormous loss

Cross-entropy barely rewards correct predictions but severely punishes confident wrong ones. This asymmetry is what forces the model to learn calibrated probabilities rather than just picking the most common class.

For a randomly initialized 10-class model, each class gets roughly $p = 0.1$ , so the expected initial loss is $-\log(0.1) = \ln(10) \approx 2.3$ . If your initial loss is much higher or lower, something is wrong with your architecture or data pipeline.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch.nn.functional as F

# Model predicts logits, true label is digit "3"
logits = model(torch.randn(1, 1, 28, 28))
target = torch.tensor([3])

loss = F.cross_entropy(logits, target)
print(f"Loss: {loss.item():.4f}")  # ~2.3 for random predictions

# What cross_entropy does internally:
# 1. Apply softmax to logits -> probabilities
# 2. Take the probability of the correct class
# 3. Take -log of that probability
probs = torch.softmax(logits, dim=1)
manual_loss = -torch.log(probs[0, 3])
print(f"Manual loss: {manual_loss.item():.4f}")  # same value

Gradient descent: rolling downhill

The loss is a function of all 101,770 parameters. Visualized, it's a landscape in high-dimensional space with hills (high loss) and valleys (low loss). We want the deepest valley.

Gradient descent does this by computing the gradient - the direction of steepest ascent in the loss landscape - and stepping in the opposite direction:

w_{t+1} = w_t - \eta \cdot \nabla \mathcal{L}(w_t)

where $\eta$ is the learning rate (step size) and $\nabla \mathcal{L}$ is the gradient of the loss with respect to the weights.

The learning rate is the most consequential hyperparameter in training:

Too large ( $\eta = 1.0$ ): steps overshoot the minimum. The loss oscillates or diverges.
Too small ( $\eta = 0.00001$ ): each step is tiny. Training takes forever and can get stuck in shallow local minima.
About right ( $\eta \approx 0.001$ to $0.01$ ): steady descent toward the minimum.

Stochastic gradient descent (SGD)

Computing the gradient over the entire 60,000-example training set is expensive. Stochastic Gradient Descent approximates the true gradient by computing it on a small random subset - a mini-batch of 32-128 examples:

w_{t+1} = w_t - \eta \cdot \nabla \mathcal{L}_\text{batch}(w_t)

The batch gradient is a noisy estimate of the true gradient. The noise helps: it can bounce the optimizer out of shallow local minima that full-batch gradient descent would get stuck in. One epoch is one full pass through the training set, shuffled into mini-batches.

Backpropagation: computing gradients efficiently

The gradient $\nabla \mathcal{L}$ has one component per parameter - 101,770 partial derivatives. Computing each one separately would require 101,770 forward passes. Backpropagation computes all of them in a single backward pass.

The key is the chain rule. If $y = f(g(x))$ :

\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}

A neural network is a composition of functions: $\text{loss} = L(\text{softmax}(W_2 \cdot \text{relu}(W_1 \cdot x + b_1) + b_2), y)$ . Backpropagation applies the chain rule to this composition, layer by layer, from output back to input. At each layer it computes:

The gradient of the loss with respect to the layer's output (received from the layer above)
The gradient of the layer's output with respect to its weights (computed locally - these are the weight updates we want)
The gradient of the layer's output with respect to its input (passed down to the layer below)

One backward pass computes all 101,770 gradients simultaneously. In PyTorch, this is a single call:

python

1
loss.backward()  # computes gradients for EVERY parameter in the model

After .backward(), each parameter tensor has a .grad attribute containing its partial derivative:

python

1
2
3
# Inspect gradients after a backward pass
for name, param in model.named_parameters():
    print(f"{name}: shape={param.shape}, grad_mean={param.grad.mean():.6f}")

The complete training loop

Put it together: forward pass, compute loss, backward pass, update weights. Four steps, repeated thousands of times.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms

# ============================================================
# 1. Load MNIST dataset
# ============================================================
transform = transforms.ToTensor()  # converts images to tensors in [0, 1]
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)

# DataLoaders handle batching and shuffling
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=1000)

print(f"Training examples: {len(train_data):,}")   # 60,000
print(f"Test examples: {len(test_data):,}")         # 10,000
print(f"Image shape: {train_data[0][0].shape}")     # (1, 28, 28)
print(f"Batches per epoch: {len(train_loader)}")    # 938

# ============================================================
# 2. Define model (same as before)
# ============================================================
model = DigitClassifier()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# ============================================================
# 3. Optimizer: vanilla SGD with learning rate 0.01
# ============================================================
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# ============================================================
# 4. Training loop: 5 epochs
# ============================================================
for epoch in range(5):
    model.train()  # enable training mode
    total_loss = 0
    correct_train = 0

    for images, labels in train_loader:
        # FORWARD PASS: input -> prediction
        logits = model(images)

        # LOSS: how wrong are we?
        loss = F.cross_entropy(logits, labels)

        # BACKWARD PASS: compute gradients
        optimizer.zero_grad()  # clear gradients from previous batch
        loss.backward()        # compute new gradients via backpropagation

        # UPDATE: adjust weights
        optimizer.step()       # w = w - lr * grad

        # Track metrics
        total_loss += loss.item()
        correct_train += (logits.argmax(1) == labels).sum().item()

    # ========================================================
    # Evaluate on test set (no gradients needed)
    # ========================================================
    model.eval()
    correct_test = 0
    with torch.no_grad():
        for images, labels in test_loader:
            preds = model(images).argmax(dim=1)
            correct_test += (preds == labels).sum().item()

    train_acc = correct_train / len(train_data) * 100
    test_acc = correct_test / len(test_data) * 100
    avg_loss = total_loss / len(train_loader)

    print(f"Epoch {epoch+1}/5: loss={avg_loss:.4f}, "
          f"train_acc={train_acc:.1f}%, test_acc={test_acc:.1f}%")

Running this on a laptop (no GPU needed):

1
2
3
4
5
Epoch 1/5: loss=1.2847, train_acc=78.4%, test_acc=87.3%
Epoch 2/5: loss=0.5231, train_acc=89.1%, test_acc=91.2%
Epoch 3/5: loss=0.4012, train_acc=91.0%, test_acc=92.4%
Epoch 4/5: loss=0.3524, train_acc=92.2%, test_acc=93.1%
Epoch 5/5: loss=0.3198, train_acc=93.0%, test_acc=93.6%

From 10% to 93.6% in 30 seconds. Each epoch, the model processes all 60,000 training images in batches of 64, backpropagates through the chain rule, and updates the weights 938 times. The training accuracy (93.0%) and test accuracy (93.6%) are close - the model is generalizing, not memorizing. If the gap were large, that would indicate overfitting.

But 93.6% with vanilla SGD is just the start. Swapping one line of code - the optimizer - can push that to 97%+. The next post covers exactly that: how Adam and its variants improve on vanilla SGD by adapting the step size per parameter, and why that matters at scale.