A 7-billion-parameter language model is, at its core, a single mathematical operation repeated millions of times: multiply a vector by a matrix, add a bias, apply a nonlinearity. That's it. The gap between that simple operation and a model that writes code, reasons about philosophy, and translates between languages is almost entirely a question of scale and what you train it on.
Understanding that gap - really understanding it, not just accepting it - requires building from the bottom. So that's what we'll do. The running example is a network that classifies handwritten digits. It's small enough to train on a laptop CPU in 30 seconds, and every concept we introduce is something you can watch affect the training curve.
Vectors
Everything in deep learning is a vector. Not figuratively - literally. An MNIST image is a vector. A word is a vector. A sentence is a sequence of vectors. The entire state of a neural network at inference time is a chain of vector transformations.
A vector is an ordered list of numbers. A single number (a scalar) describes one quantity - 72 degrees Fahrenheit. Most things in the world need more. A map position needs two numbers. A color needs three. A word's meaning in a language model needs 768 or more, each dimension capturing some aspect of how that word is used in context.
Direction and magnitude
Vectors have two properties that matter. Direction encodes what the vector represents - which combination of features it emphasizes. Magnitude encodes how strongly.
A vector points purely along the first axis. purely along the second. points between them with magnitude . The magnitude formula extends to any dimension:
The dimensionality trick
The math works identically in any number of dimensions. A dot product between two 768-dimensional vectors follows the same formula as between two 2D vectors - just more terms to sum. This is not a special property of neural networks; it's linear algebra. But it has a profound consequence: we can represent meaning, syntax, and world knowledge as points in a very high-dimensional space, and all the geometric intuitions we have in 2D still apply.
There is a useful counterintuitive fact about high-dimensional spaces: almost all pairs of random vectors are nearly perpendicular. In 768 dimensions there is a vast amount of room for vectors to be distinct from each other. This is why high-dimensional embeddings can encode so much information without interfering.
The entire job of a neural network is to take one vector and transform it into another: pixels in, digit label out. But to transform a vector, you need an operation. That operation is the dot product.
Dot Products - Measuring Similarity
The most important operation
The dot product takes two vectors of the same length and produces a single number. Multiply corresponding elements, then sum:
For example:
The geometric interpretation
The dot product has a geometric meaning that makes it indispensable:
where is the angle between the two vectors. The result:
- Same direction (, ): maximum positive value. The vectors agree completely.
- Perpendicular (, ): zero. The vectors share no information.
- Opposite directions (, ): maximum negative value. The vectors disagree.
This is why the dot product is the foundation of attention in Transformers: it measures how similar two vectors are. The query vector "asks a question" and the key vector "offers an answer." Their dot product measures how well the answer matches the question.
Cosine similarity
If you care about direction but not magnitude, normalize by dividing out the magnitudes:
This is used extensively in NLP to compare word embeddings. "King" and "queen" have high cosine similarity because they point in similar directions in embedding space, even if they differ in magnitude.
Every neuron in a neural network computes a dot product. A neuron with weights receiving inputs computes . The neuron is measuring how similar the input is to its learned weight pattern. Inputs that match produce large outputs; inputs that don't match produce small ones.
A single dot product produces a single number: one neuron's response to one input. To run an entire layer of neurons at once, we need to compute many dot products simultaneously. That's matrix multiplication.
Matrix Multiplication
Many dot products at once
A matrix is a 2D grid of numbers. Matrix multiplication is the operation that makes neural networks computationally feasible: it computes many dot products in parallel.
When you multiply matrix (shape ) by matrix (shape ), each element of the result is the dot product of one row from with one column from . The result has shape .
The dimension rule: . The inner dimension must match - that's the shared length of each dot product.
Every layer of a neural network is fundamentally a matrix multiplication. A layer with 784 inputs and 128 outputs has a weight matrix of shape . Feed in one input vector and you compute 128 dot products simultaneously - one per output neuron.
The real power comes from batching. Stack 64 input vectors into a matrix of shape , multiply by the weight matrix (), and get a result - all 64 inputs processed through all 128 neurons in a single operation. This is why GPUs, built for parallel matrix math, matter so much. A forward pass isn't a loop over examples; it's one matrix multiply.
Multiplying a vector by a matrix doesn't just scale it - it transforms it geometrically: rotating, projecting, or compressing it. A matrix transforms 784-dimensional space into 128-dimensional space. Each layer of a neural network applies a different transformation. The network learns a sequence of transformations that, composed together, map pixel space to digit-label space.
To compose many layers efficiently, we need one more abstraction: tensors.
Tensors - Thinking Beyond 2D
The generalization of matrices
In real deep learning code you won't see "matrix" or "vector." You'll see tensor - a generalization to any number of dimensions:
- A scalar is a 0D tensor (a single number)
- A vector is a 1D tensor (a list)
- A matrix is a 2D tensor (a grid)
- An nD tensor extends to any number of axes
Why we need 3D (and beyond)
Training a model one example at a time is painfully slow. Modern GPUs have thousands of cores designed for parallel computation. To use them, we process data in batches - feeding dozens or hundreds of examples through the network simultaneously.
In a Transformer, the primary working tensor has three dimensions:
- Batch Size (e.g., 32): how many independent sequences the GPU processes in parallel
- Sequence Length (e.g., 512): how many tokens in each sequence
- Embedding Dimension (e.g., 768): the size of each token's vector representation
A single forward pass through a Transformer processes million numbers simultaneously. This is only possible because GPUs are built for exactly this kind of parallel tensor math.
PyTorch handles tensor operations intelligently through broadcasting - automatically expanding smaller tensors to match larger ones when the shapes are compatible:
Now we have the machinery to describe a single computation unit: the neuron. But there's a problem we haven't addressed yet. Stacking linear transformations is useless - mathematically, any stack of matrix multiplications collapses to a single matrix multiplication. To learn anything interesting, we need nonlinearity.
The Neuron - A Dot Product with a Twist
From math to computation
A neuron computes the dot product of its inputs with its learned weights, adds a bias, and passes the result through a nonlinear activation function:
The weights determine what input pattern the neuron responds to. The bias shifts the activation threshold. The activation function is what separates a stack of matrix multiplications from something that can actually learn.
Why non-linearity matters: the XOR problem
Consider the XOR function: it outputs 1 when exactly one of two inputs is 1, and 0 otherwise.
| XOR | ||
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
A single neuron without activation computes - a straight line in 2D space. XOR cannot be separated by any straight line. The positive examples (0,1) and (1,0) sit on opposite sides of any line you could draw. No values of , , and can solve it.
Two neurons with a nonlinear activation, feeding into a third, can solve XOR. The first layer creates a new representation where the problem becomes linearly separable. This is the core insight of deep learning: each layer creates a new representation of the data that makes the problem easier for the next layer. It's also why the 1969 proof that single neurons can't learn XOR triggered a decade-long funding drought - the community hadn't yet realized that the fix was to stack them.
Activation functions
Without activation functions, stacking layers is pointless. A composition of linear functions is still linear: - one matrix multiplication, regardless of depth. Activation functions break this collapse.
ReLU (Rectified Linear Unit) -
The simplest and most widely used. Positive values pass through unchanged; negative values become zero. ReLU solved the vanishing gradient problem that plagued sigmoid and tanh networks - its gradient is either 0 or 1, never a fraction that compounds to zero across many layers. The downside: "dead neurons." A neuron whose output is always negative has a permanently zero gradient and stops learning. Variants like Leaky ReLU () and GELU (, used in Transformers) address this.
Sigmoid -
Squashes any real number to . Historically important but rarely used in hidden layers now because its gradient vanishes for large inputs ( as ). Still used in the output layer for binary classification, where the output needs to be interpretable as a probability.
Tanh -
Like sigmoid but centered at zero, outputting in . The zero-centering helps gradient flow compared to sigmoid, but it still suffers from vanishing gradients at extremes.
A single neuron detects one pattern. To detect many patterns simultaneously, we run many neurons in parallel - that's a layer. To detect patterns of patterns, we stack layers. That's a deep network.
Deep Neural Networks - Stacking Layers
From one neuron to many
A layer of neurons computes many dot products in parallel - it detects many patterns simultaneously. A deep network stacks multiple layers, where each layer builds on the representations created by the previous one.
Each layer applies:
- A matrix multiplication () - the weights
- A bias addition ()
- An activation function () - the nonlinearity
Mathematically: , where is the layer's output ("hidden representation").
Counting parameters
Every connection between nodes is a weight. Every node in a hidden or output layer has a bias. For a network with layers :
- Layer 1 (input to hidden 1): parameters
- Layer 2 (hidden 1 to hidden 2): parameters
- Layer 3 (hidden 2 to output): parameters
- Total: 62 parameters
The general formula for a fully connected layer:
GPT-3 has 175 billion parameters. The structure is identical - weights and biases in layers. The scale is not.
Why depth works: hierarchical feature learning
Cybenko (1989) proved the universal approximation theorem: a network with a single hidden layer can approximate any continuous function given enough neurons. So why use depth at all?
Because depth is exponentially more efficient than width. A shallow network might need millions of neurons to learn a complex function. A deep network can learn the same function with far fewer parameters by building hierarchical representations:
- Layer 1 detects edges and gradients in an image
- Layer 2 combines edges into corners, curves, and textures
- Layer 3 combines those into parts - eyes, noses, ears
- Layer 4 combines parts into faces
Each layer reuses simpler features learned by the layer before it. Compositionality is what makes depth worth the training complexity.
The forward pass step by step
Our MNIST model in PyTorch
MNIST images are pixels. We classify them into 10 digits (0-9). A two-layer network is enough:
The model has 101,770 learnable parameters. The forward method is two matrix multiplications with a ReLU between them - exactly the math we've been building up. But its output - 10 raw numbers called logits - is not yet interpretable as a probability. That requires one more function.
The Softmax Function
From logits to probabilities
Our network outputs 10 raw numbers per input. These can be any value: positive, negative, large, small. To interpret them as "the model thinks this is probably a 7," we need to convert them into a probability distribution: all values between 0 and 1, summing to exactly 1.
Softmax does this in two steps:
- Exponentiate each value (making everything positive)
- Divide by the total (normalizing to sum to 1)
Why exponential?
The exponential function amplifies differences. If two logits differ by 2, their exponentials differ by a factor of . The largest logit dominates the probability distribution. The model becomes decisive.
This amplification is controlled by temperature. Dividing logits by a temperature before softmax adjusts how peaked the distribution is:
- : all probability on the largest logit (completely confident)
- : standard softmax
- : uniform distribution (completely uncertain)
The scaling in Transformer attention is effectively a temperature parameter. Without it, dot products between high-dimensional vectors grow large enough that softmax outputs near-binary distributions, killing gradient flow during training.
Numerical stability
Computing overflows to infinity. The fix: subtract the maximum logit before exponentiating.
This is mathematically identical (the max cancels out) but keeps all exponents in a reasonable range. PyTorch handles this automatically inside F.cross_entropy.
The model now produces a probability distribution. The question is: how do we know how wrong it is? And how do we make it less wrong?
How Neural Networks Learn
The training problem
We have a model with 101,770 parameters (weights and biases). Currently they're random, so the model guesses randomly - about 10% accuracy on 10 classes. We need to find values for these parameters that produce correct predictions. But the search space has more configurations than atoms in the universe.
The solution: gradient descent. Instead of searching blindly, we start with random parameters and iteratively improve them by following the gradient of a loss function - a measure of how wrong we are.
Loss functions: measuring how wrong you are
A loss function takes the model's prediction and the true answer and outputs a single number measuring how wrong the prediction is. Lower is better. Zero means perfect.
For classification, the standard is cross-entropy loss:
where is the probability the model assigns to the correct class.
The shape of is what makes this work:
- If (very confident and correct): - tiny loss
- If (uncertain): - moderate loss
- If (very confident and wrong): - enormous loss
Cross-entropy barely rewards correct predictions but severely punishes confident wrong ones. This asymmetry is what forces the model to learn calibrated probabilities rather than just picking the most common class.
For a randomly initialized 10-class model, each class gets roughly , so the expected initial loss is . If your initial loss is much higher or lower, something is wrong with your architecture or data pipeline.
Gradient descent: rolling downhill
The loss is a function of all 101,770 parameters. Visualized, it's a landscape in high-dimensional space with hills (high loss) and valleys (low loss). We want the deepest valley.
Gradient descent does this by computing the gradient - the direction of steepest ascent in the loss landscape - and stepping in the opposite direction:
where is the learning rate (step size) and is the gradient of the loss with respect to the weights.
The learning rate is the most consequential hyperparameter in training:
- Too large (): steps overshoot the minimum. The loss oscillates or diverges.
- Too small (): each step is tiny. Training takes forever and can get stuck in shallow local minima.
- About right ( to ): steady descent toward the minimum.
Stochastic gradient descent (SGD)
Computing the gradient over the entire 60,000-example training set is expensive. Stochastic Gradient Descent approximates the true gradient by computing it on a small random subset - a mini-batch of 32-128 examples:
The batch gradient is a noisy estimate of the true gradient. The noise helps: it can bounce the optimizer out of shallow local minima that full-batch gradient descent would get stuck in. One epoch is one full pass through the training set, shuffled into mini-batches.
Backpropagation: computing gradients efficiently
The gradient has one component per parameter - 101,770 partial derivatives. Computing each one separately would require 101,770 forward passes. Backpropagation computes all of them in a single backward pass.
The key is the chain rule. If :
A neural network is a composition of functions: . Backpropagation applies the chain rule to this composition, layer by layer, from output back to input. At each layer it computes:
- The gradient of the loss with respect to the layer's output (received from the layer above)
- The gradient of the layer's output with respect to its weights (computed locally - these are the weight updates we want)
- The gradient of the layer's output with respect to its input (passed down to the layer below)
One backward pass computes all 101,770 gradients simultaneously. In PyTorch, this is a single call:
After .backward(), each parameter tensor has a .grad attribute containing its partial derivative:
The complete training loop
Put it together: forward pass, compute loss, backward pass, update weights. Four steps, repeated thousands of times.
Running this on a laptop (no GPU needed):
From 10% to 93.6% in 30 seconds. Each epoch, the model processes all 60,000 training images in batches of 64, backpropagates through the chain rule, and updates the weights 938 times. The training accuracy (93.0%) and test accuracy (93.6%) are close - the model is generalizing, not memorizing. If the gap were large, that would indicate overfitting.
But 93.6% with vanilla SGD is just the start. Swapping one line of code - the optimizer - can push that to 97%+. The next post covers exactly that: how Adam and its variants improve on vanilla SGD by adapting the step size per parameter, and why that matters at scale.
