Fine-tuning is how we take a general-purpose language model and make it yours - trained on your data, for your task. But there's a problem. A big one.
A 7-billion parameter model in fp16 takes 14 GB just to load the weights. Full fine-tuning requires storing gradients (another 14 GB) and optimizer states (Adam keeps two fp32 copies per parameter - that's 56 GB). Add it up: fine-tuning a 7B model needs roughly 84 GB of GPU memory. That's more than a single A100. For a 70B model? You need a cluster.
This is absurd. We want to teach the model a new task - maybe classify customer support tickets, or write code in a particular style. Do we really need to update all 7 billion parameters?
The answer, as it turns out, is no. Not even close.
The Fine-Tuning Dilemma
Let's break down where the memory goes during full fine-tuning. For a model with parameters:
| Component | Memory (fp16/fp32) | 7B Model |
|---|---|---|
| Model weights | bytes (fp16) | 14 GB |
| Gradients | bytes (fp16) | 14 GB |
| Adam optimizer (momentum) | bytes (fp32) | 28 GB |
| Adam optimizer (variance) | bytes (fp32) | 28 GB |
| Total | bytes | 84 GB |
The optimizer states alone consume 4x the model size. This is because Adam (and AdamW, which we covered in the optimizers post) maintains two running averages per parameter in full precision (fp32).
And this is just the static memory. During the forward pass, you also store activations for backpropagation - that's additional memory proportional to batch size and sequence length.
The chart above tells the story. Select different model sizes and watch the bars. Full fine-tuning scales linearly and brutally. A 65B model needs over 780 GB - that's 10 A100-80GB GPUs just for optimizer states.
But look at LoRA and QLoRA. The bars barely grow. A 65B model with QLoRA fits on a single 48GB GPU. How?
The Low-Rank Hypothesis
What is matrix rank?
The rank of a matrix is the number of linearly independent rows (or equivalently, columns). It tells you the "true dimensionality" of the information the matrix encodes.
A matrix has at most rank . But many real-world matrices have much lower effective rank - most of their information concentrates in a few dimensions, with the rest being noise or redundancy.
Consider the Singular Value Decomposition (SVD). Any matrix can be decomposed as:
where and are orthogonal matrices and is a diagonal matrix of singular values .
What does each piece mean? Think of any matrix as a transformation that acts on vectors. SVD breaks that transformation into three simpler steps:
- (Rotate): First, rotate the input into a new coordinate system that aligns with the "natural axes" of the transformation — the directions along which the matrix acts most cleanly.
- (Scale): Then, stretch or shrink along each of those axes. The singular values are exactly these stretch factors, sorted from largest to smallest.
- (Rotate again): Finally, rotate the result into the output coordinate system.
The singular values tell you how much "energy" each dimension carries. A large means that dimension contributes a lot to the matrix's overall effect; a tiny means that dimension is almost negligible.
A concrete example. Take this small matrix:
This is already diagonal, so the SVD is trivial: (the identity matrix) and itself. The singular values are , , . Notice that the third singular value is tiny — it contributes almost nothing. If we set it to zero, we get a rank-2 approximation that is nearly identical to the original:
We just went from 9 parameters to describing the matrix with only 2 meaningful values — and barely lost any information.
Of course, real matrices aren't diagonal. But that's exactly what SVD does: it finds the right rotations ( and ) so that after rotating, the matrix becomes diagonal (). Then you can look at the diagonal entries and decide which ones actually matter.
Why does this help? If the singular values drop off quickly — say — then the matrix is well-approximated by keeping only the top components:
Here is the first columns of , is the top-left block of , and is the first rows of . This is known as the Eckart–Young theorem: this rank- matrix is the best possible rank- approximation to (in terms of minimizing the Frobenius norm of the error).
This rank- approximation uses parameters instead of . For and , that's instead of - a 128x reduction.
The interactive visualization below lets you see this in action. Drag the slider to change the rank and watch how the approximation converges to the original:
Notice something remarkable: even at very low ranks (5-10), the approximation captures the main structure. The cross pattern, the circular ring, the smooth gradients - all the meaningful information lives in a low-dimensional subspace. The high-rank components are just noise.
Aghajanyan et al.: Intrinsic Dimensionality
In 2021, Aghajanyan, Gupta, and Zettlemoyer published a paper called "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." Their key finding:
Pre-trained language models have a very low intrinsic dimensionality for downstream tasks.
What does this mean? They projected gradient updates into random low-dimensional subspaces and found that fine-tuning worked almost as well. A RoBERTa model with 355 million parameters could be fine-tuned effectively in a subspace of dimension 200 - that's 0.00006% of the full parameter space.
This makes intuitive sense. Pre-training already teaches the model a rich representation of language. Fine-tuning for a specific task is a small adjustment to that representation - moving through a low-dimensional manifold in parameter space, not exploring the full -dimensional landscape.
The weight change has low intrinsic rank. This is the key insight that makes LoRA possible.
LoRA: Low-Rank Adaptation
The core idea
LoRA (Hu et al., 2021) makes the low-rank hypothesis explicit and practical. Instead of updating a weight matrix directly, we decompose the update into two smaller matrices:
where:
- (the "up-projection")
- (the "down-projection")
- is the rank
The original weight stays frozen - we never compute gradients for it, never store optimizer states for it. Only and are trainable.
Drag the rank slider above. At with , we need trainable parameters instead of . That's a 4x reduction in this toy example. At real model scales (, ), the ratio is 256x.
Parameter savings math
For a single weight matrix :
| Method | Trainable params |
|---|---|
| Full fine-tuning | |
| LoRA (rank ) | |
| Savings | times fewer |
For a square matrix (), the savings simplify to . With and :
In a transformer model, LoRA is typically applied to the attention projection matrices (Q, K, V, O). A 7B model like LLaMA-2-7B has 32 transformer layers, each with 4 projection matrices of size . With rank :
That's 16.8 million trainable parameters out of 7 billion total - roughly 0.24% of the model. And yet this achieves performance competitive with full fine-tuning on many tasks.
The forward pass
During the forward pass, both paths run in parallel:
The input flows through two paths:
- Frozen path: - the original pretrained computation
- LoRA path: - the learned adaptation, scaled by
The results are summed to produce the output .
The scaling factor is important. It controls the magnitude of the LoRA update relative to the original weights. The parameter is a hyperparameter (typically set to 16 or 32). Dividing by means that when you increase the rank, each individual component contributes less - the total update magnitude stays approximately constant across different rank choices. This makes roughly invariant to the choice of .
Initialization
LoRA uses a specific initialization scheme that ensures the model starts exactly where pre-training left off:
- is initialized with random Gaussian values (small, like Kaiming init)
- is initialized to all zeros
This means at the start of training. The model's output is initially identical to the pre-trained model. Training then gradually learns the low-rank update .
This is crucial for stability. You're not perturbing the pretrained model at initialization - you're starting from exactly the pretrained weights and smoothly moving toward the fine-tuned solution.
Let's verify the zero-initialization property:
Where to Apply LoRA
Transformer weight matrices
A standard transformer layer has several weight matrices. Which ones should get LoRA adapters?
Attention projections:
- (query projection):
- (key projection):
- (value projection):
- (output projection):
Feed-forward network (FFN):
- (up-projection): (typically )
- (down-projection):
- (gate projection, in gated FFNs like LLaMA):
What the original paper found
Hu et al. (2021) experimented with different combinations on GPT-3 175B. Their findings:
| LoRA applied to | Params | WikiSQL Acc | MultiNLI Acc |
|---|---|---|---|
| only | 4.7M | 73.4 | 91.7 |
| only | 4.7M | 73.2 | 91.3 |
| only | 4.7M | 73.8 | 91.7 |
| 9.4M | 74.4 | 91.7 | |
| 18.9M | 74.6 | 91.8 |
Key takeaways:
- and together gives most of the benefit. Adding and helps only marginally.
- More matrices with lower rank can outperform fewer matrices with higher rank, given the same total parameter budget.
- Even a single matrix ( or ) gets remarkably close to the best result.
Modern practice
In current practice (2024-2026), most practitioners apply LoRA to all linear layers - attention projections plus FFN weights. The HuggingFace PEFT library makes this trivial. Here's why the trend shifted:
- FFN weights store factual knowledge. The attention layers learn how to route information, but the FFN layers store what the model knows. For knowledge-intensive tasks (QA, factual generation), adapting FFN layers is important.
- The cost is still tiny. Even with LoRA on all matrices, trainable parameters are well under 1% of total.
- Rank can be lower. With more matrices adapted, each individual rank can be smaller while achieving the same total expressiveness.
Rank Selection: How to Choose r
The rank is LoRA's most important hyperparameter. Too low and the model can't learn the task. Too high and you waste memory without benefit.
The surprising effectiveness of low ranks
One of the most striking findings in the LoRA paper is how effective very low ranks are. On GPT-3 175B:
| Rank | Trainable Params | WikiSQL Acc | MultiNLI Acc |
|---|---|---|---|
| 1 | 0.3M | 73.2 | 91.2 |
| 2 | 0.6M | 73.6 | 91.6 |
| 4 | 1.2M | 73.9 | 91.5 |
| 8 | 2.4M | 74.0 | 91.7 |
| 64 | 18.9M | 73.7 | 91.6 |
Rank 4 is essentially as good as rank 64. Rank 1 is within 1% of the best! This strongly supports the low-rank hypothesis: the meaningful fine-tuning update really does live in a very low-dimensional subspace.
Performance even decreases slightly at rank 64. This is likely overfitting - with more parameters, the model starts memorizing training examples rather than learning the general pattern. LoRA's rank constraint acts as an implicit regularizer (similar to what we discussed in the regularization post).
Practical guidelines
Based on the literature and practitioner experience:
| Rank | Use case |
|---|---|
| Simple classification, style transfer | |
| General instruction tuning, most NLP tasks | |
| The safe default; works well almost everywhere | |
| Complex reasoning, math, coding tasks | |
| When you have abundant data and compute; diminishing returns | |
| Rarely needed; consider full fine-tuning if this seems necessary |
The heuristic: start with . If your task is simple and your data is small, try or . If performance isn't sufficient, increase rank before trying other interventions.
The rank-alpha relationship
The scaling factor means that changing rank changes the effective learning rate of the LoRA parameters. Common conventions:
- : Effective scaling is 1.0. The LoRA update magnitude is independent of rank.
- : The default in many implementations. Slightly amplifies the LoRA update.
- Fixed or : Used when the practitioner wants to sweep rank without changing the effective scale.
When sweeping rank, keep constant (or equivalently, adjust the learning rate proportionally). Otherwise, you're confounding rank with learning rate.
QLoRA: Quantization Meets LoRA
LoRA reduces the trainable parameter count and the optimizer memory. But the frozen base model still needs to be loaded into GPU memory. A 65B model in fp16 is 130 GB - far beyond any single GPU.
QLoRA (Dettmers et al., 2023) solves this by quantizing the frozen base model to 4 bits while keeping LoRA adapters in fp16/bf16. Three innovations make this work:
1. 4-bit NormalFloat (NF4) quantization
Standard 4-bit integer quantization is too crude for neural network weights, which follow an approximately normal (Gaussian) distribution. NF4 is an information-theoretically optimal data type for normally-distributed data.
The idea: map the 16 possible 4-bit values to the quantiles of a standard normal distribution. This ensures equal numbers of weights map to each quantization bucket, minimizing information loss.
where is the inverse normal CDF (the quantile function). The quantization levels are:
Each weight is normalized by the block's absmax value, mapped to the nearest NF4 level, and stored as a 4-bit index. This achieves much better precision than naive int4 quantization.
2. Double quantization
Each block of weights (typically 64 values) requires a scaling constant (absmax value) stored in fp32 - that's 32 bits of overhead per 64 weights, or 0.5 bits per parameter.
Double quantization quantizes these scaling constants themselves to 8-bit floats, using a second level of block-wise quantization (blocks of 256 scaling constants). This reduces the overhead from 0.5 bits to ~0.127 bits per parameter. Not a huge savings, but it adds up at scale.
3. Paged optimizers
When GPU memory is almost full, occasional spikes (from long sequences or large batches) can cause OOM errors. QLoRA uses NVIDIA's unified memory to automatically page optimizer states to CPU RAM when GPU memory is exhausted, and page them back when needed. This acts as a safety net against OOM crashes.
Memory comparison
Let's do the full accounting for a 65B model:
| Component | Full FT (fp16) | LoRA (fp16) | QLoRA (NF4+fp16) |
|---|---|---|---|
| Base model | 130 GB | 130 GB | 33 GB (4-bit) |
| LoRA adapters | - | 0.08 GB | 0.08 GB |
| Gradients | 130 GB | 0.08 GB | 0.08 GB |
| Optimizer states | 520 GB | 0.32 GB | 0.32 GB |
| Total | 780 GB | 130.5 GB | 33.5 GB |
QLoRA fits on a single 48GB GPU. Full fine-tuning needs a cluster of 10+ A100-80GB GPUs.
And the performance? Dettmers et al. showed that QLoRA matches 16-bit LoRA on essentially every benchmark. The 4-bit quantization of the frozen weights introduces negligible error because the LoRA adapters learn to compensate.
QLoRA in practice
Other Parameter-Efficient Methods
LoRA isn't the only PEFT method. Let's survey the landscape and see how alternatives compare.
Prefix Tuning (Li & Liang, 2021)
Idea: prepend learnable "virtual tokens" to the key and value sequences in every attention layer. These prefix vectors are the only trainable parameters.
For a model with layers and prefix length :
The factor of 2 is for keys and values. With , , :
Pros: Elegant formulation; no modification to model weights. Cons: Reduces effective context length by tokens; training can be unstable; inference has non-trivial overhead (extra KV entries at every layer).
Adapters (Houlsby et al., 2019)
Idea: insert small bottleneck layers (down-project, nonlinearity, up-project) between existing transformer sub-layers.
Each adapter has parameters (two projections plus a bias). With adapters after both attention and FFN in all 32 layers:
Pros: Well-studied; works reliably across tasks. Cons: Adds sequential computation to the forward pass (adapters can't be parallelized with the main path). This increases inference latency.
IA3 (Liu et al., 2022)
Idea: learn scaling vectors that rescale the keys, values, and FFN intermediate activations. No new weight matrices - just element-wise multiplication.
where and .
Params per layer: . Total for 32 layers: .
Pros: Extremely few parameters; zero inference overhead (scaling can be fused into weights). Cons: Less expressive than LoRA; struggles on tasks requiring significant adaptation.
Comparison
| Method | Trainable Params (7B model) | Inference Overhead | Training Stability | Performance |
|---|---|---|---|---|
| Full Fine-Tuning | 7B (100%) | None | Good | Best |
| LoRA (r=16) | ~17M (0.24%) | None (merge) | Good | Near-best |
| QLoRA (r=16) | ~17M (0.24%) | None (merge) | Good | Near-best |
| Prefix Tuning (p=20) | ~5M (0.07%) | Moderate | Unstable | Good |
| Adapters (r=64) | ~34M (0.49%) | Sequential | Good | Good |
| IA3 | ~0.8M (0.01%) | None | Good | Moderate |
LoRA's unique advantage: zero inference overhead. Because can be merged into a single matrix at deployment time, there's no extra computation during inference. Adapters add sequential bottleneck layers. Prefix tuning adds extra KV entries. LoRA adds nothing.
This is why LoRA dominates in practice.
Merging LoRA Weights
The merge operation
One of LoRA's most elegant properties: at inference time, you can compute once and replace the original weight. The model is now identical to one that was fully fine-tuned (at the rank- approximation level), with zero runtime overhead.
With the PEFT library:
Multiple LoRA adapters: model switching
Because LoRA adapters are small (~50-100 MB for a 7B model), you can train many adapters for different tasks and swap them at runtime:
This is how companies serve hundreds of specialized models from a single base model in production. The base model stays in GPU memory (expensive), while LoRA adapters are swapped in and out (cheap - just loading a few MB of weights).
LoRA adapter arithmetic
Because LoRA adapters are linear updates (), they support linear arithmetic:
This enables:
- Task interpolation: blend a code adapter and a math adapter to get a model good at both
- Task negation: subtract an adapter's effect () to remove a capability
- Progressive merging: gradually increase from 0 to 1 to smoothly transition between behaviors
This linearity doesn't hold for full fine-tuning (nonlinear optimization landscape), making LoRA uniquely composable.
Advanced LoRA Variants
The success of LoRA has spawned numerous variants. Here are the most important ones:
DoRA (Weight-Decomposed Low-Rank Adaptation)
DoRA (Liu et al., 2024) decomposes the weight update into magnitude and direction components:
where is a learnable magnitude vector and denotes column-wise normalization. This is inspired by weight normalization and consistently outperforms standard LoRA by 1-2% across benchmarks, with minimal additional cost (just one extra vector ).
AdaLoRA (Adaptive LoRA)
Standard LoRA uses the same rank for every weight matrix. AdaLoRA (Zhang et al., 2023) dynamically allocates rank across layers and matrices based on importance scores:
Layers with higher importance (measured by gradient-based sensitivity) get more rank. Empirically, AdaLoRA concentrates rank in the lower and upper layers (which tend to be more task-specific), while middle layers get lower rank.
LoRA+ (Different Learning Rates for A and B)
Hayou et al. (2024) showed that using different learning rates for and improves convergence. Specifically, setting with consistently outperforms standard LoRA with a single learning rate. The intuition: handles the down-projection and benefits from a smaller learning rate for stability, while handles the up-projection and can learn faster.
Practical Guide: Training with LoRA
Let's put it all together with a complete training example. We'll fine-tune a language model for instruction following using QLoRA.
Setup
Configure LoRA
Training loop
Key hyperparameter notes
A few details that matter in practice:
Learning rate: LoRA benefits from a higher learning rate than full fine-tuning. Typical values are to , compared to to for full fine-tuning. The reason: LoRA parameters start at zero (due to initialization), so they need larger updates to move away from the initialization.
Dropout: LoRA-specific dropout (lora_dropout) is applied between and , i.e., to the intermediate rank- representation. Values of 0.05-0.1 work well. This is separate from any dropout in the base model.
Gradient checkpointing: Always enable this for QLoRA training. It recomputes activations during backprop instead of storing them, trading ~20% speed for ~60% memory savings on activations.
Optimizer: paged_adamw_8bit combines the QLoRA paged optimizer (OOM safety net) with 8-bit Adam (further memory savings on optimizer states). The 8-bit quantization of optimizer states has been shown to cause no degradation in practice.
Save and merge
Debugging LoRA Training
Common issues and how to fix them:
Loss doesn't decrease
- Check rank: Try increasing to or
- Check target modules: Are you applying LoRA to the right layers? Use
model.print_trainable_parameters()to verify - Check learning rate: LoRA needs higher LR than full FT (try )
- Check alpha: If is too small, LoRA updates are suppressed
Loss decreases but eval quality is poor
- Overfitting: Lower rank, increase dropout, or add more data
- Data quality: Garbage in, garbage out. Check training examples
- Evaluation mismatch: Ensure eval format matches training format
Out of memory
- Enable gradient checkpointing:
model.gradient_checkpointing_enable() - Reduce batch size: Use gradient accumulation to maintain effective batch size
- Switch to QLoRA: 4-bit base model + fp16 LoRA
- Reduce sequence length: Memory scales linearly with sequence length
Performance gap with full fine-tuning
This is rare with proper hyperparameters, but if it happens:
- Increase rank: Try or
- Apply to more layers: Include FFN weights, not just attention
- More epochs: LoRA sometimes needs more epochs to converge
- Try DoRA: The magnitude-direction decomposition often closes the gap
When to Use What
A decision framework:
Full fine-tuning when:
- You have abundant compute (multi-GPU cluster)
- Your task requires fundamentally different behavior from the base model
- Maximum performance matters more than efficiency
LoRA when:
- Single GPU or limited multi-GPU setup
- Your task is an adaptation of the base model's capabilities
- You need multiple task-specific models from one base
- You want zero inference overhead
QLoRA when:
- Single consumer GPU (24-48 GB)
- Fine-tuning models larger than your GPU can hold in fp16
- Development/experimentation (fast iteration)
IA3 / Prefix Tuning when:
- Extremely limited parameter budget
- Very simple tasks (classification, style transfer)
- Few-shot scenarios where even LoRA might overfit
The Bigger Picture
LoRA changed how we think about model adaptation. Before LoRA, fine-tuning was an all-or-nothing affair: you either updated every parameter or used prompting tricks. LoRA showed that the parameter space of fine-tuning is much smaller than the model itself - you can navigate it with a tiny steering wheel.
The implications extend beyond efficiency:
-
Democratization: Anyone with a consumer GPU can fine-tune state-of-the-art models. The 65B parameter barrier became a 48GB barrier, and the 7B barrier became a 6GB barrier with QLoRA.
-
Multi-tenant serving: One base model, many LoRA adapters. Companies like Predibase and Modal deploy hundreds of specialized models from a single base, swapping adapters at request time.
-
Composition: LoRA adapters can be added, subtracted, and interpolated. This opens up model merging, task arithmetic, and continual learning without catastrophic forgetting.
-
Scientific insight: The success of low-rank adaptation tells us something deep about the geometry of neural network loss landscapes. Fine-tuning doesn't explore the full parameter space - it follows low-dimensional trajectories. Understanding why this works is an active area of research.
The trajectory from full fine-tuning to LoRA to QLoRA mirrors a broader pattern in deep learning: the most impactful ideas are often the simplest. LoRA is just matrix factorization applied to weight updates. QLoRA is just quantization applied to the frozen weights. The genius is in recognizing where simplicity suffices.
Next in the series: we'll explore mixture of experts (MoE) - how sparse models like Mixtral use conditional computation to scale to hundreds of billions of parameters while only activating a fraction at inference time.
