Apr 3, 2026

LoRA & Parameter-Efficient Fine-Tuning - Adapting Giants on a Budget

How LoRA decomposes weight updates into low-rank matrices to fine-tune billion-parameter models on a single GPU. Interactive visualizations of matrix decomposition, memory savings, and the math behind efficient adaptation.

LoRA & Parameter-Efficient Fine-Tuning - Adapting Giants on a Budget hero image

Fine-tuning is how we take a general-purpose language model and make it yours - trained on your data, for your task. But there's a problem. A big one.

A 7-billion parameter model in fp16 takes 14 GB just to load the weights. Full fine-tuning requires storing gradients (another 14 GB) and optimizer states (Adam keeps two fp32 copies per parameter - that's 56 GB). Add it up: fine-tuning a 7B model needs roughly 84 GB of GPU memory. That's more than a single A100. For a 70B model? You need a cluster.

This is absurd. We want to teach the model a new task - maybe classify customer support tickets, or write code in a particular style. Do we really need to update all 7 billion parameters?

The answer, as it turns out, is no. Not even close.

The Fine-Tuning Dilemma

Let's break down where the memory goes during full fine-tuning. For a model with NN parameters:

ComponentMemory (fp16/fp32)7B Model
Model weights2N2N bytes (fp16)14 GB
Gradients2N2N bytes (fp16)14 GB
Adam optimizer (momentum)4N4N bytes (fp32)28 GB
Adam optimizer (variance)4N4N bytes (fp32)28 GB
Total12N12N bytes84 GB

The optimizer states alone consume 4x the model size. This is because Adam (and AdamW, which we covered in the optimizers post) maintains two running averages per parameter in full precision (fp32).

And this is just the static memory. During the forward pass, you also store activations for backpropagation - that's additional memory proportional to batch size and sequence length.

The chart above tells the story. Select different model sizes and watch the bars. Full fine-tuning scales linearly and brutally. A 65B model needs over 780 GB - that's 10 A100-80GB GPUs just for optimizer states.

But look at LoRA and QLoRA. The bars barely grow. A 65B model with QLoRA fits on a single 48GB GPU. How?

The Low-Rank Hypothesis

What is matrix rank?

The rank of a matrix is the number of linearly independent rows (or equivalently, columns). It tells you the "true dimensionality" of the information the matrix encodes.

A d×dd \times d matrix has at most rank dd. But many real-world matrices have much lower effective rank - most of their information concentrates in a few dimensions, with the rest being noise or redundancy.

Consider the Singular Value Decomposition (SVD). Any matrix MM can be decomposed as:

M=UΣVTM = U \Sigma V^T

where UU and VV are orthogonal matrices and Σ\Sigma is a diagonal matrix of singular values σ1σ2σd0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_d \geq 0.

What does each piece mean? Think of any matrix as a transformation that acts on vectors. SVD breaks that transformation into three simpler steps:

  1. VTV^T (Rotate): First, rotate the input into a new coordinate system that aligns with the "natural axes" of the transformation — the directions along which the matrix acts most cleanly.
  2. Σ\Sigma (Scale): Then, stretch or shrink along each of those axes. The singular values σ1,σ2,\sigma_1, \sigma_2, \ldots are exactly these stretch factors, sorted from largest to smallest.
  3. UU (Rotate again): Finally, rotate the result into the output coordinate system.

The singular values tell you how much "energy" each dimension carries. A large σi\sigma_i means that dimension contributes a lot to the matrix's overall effect; a tiny σi\sigma_i means that dimension is almost negligible.

A concrete example. Take this small 3×33 \times 3 matrix:

M=[300020000.01]M = \begin{bmatrix} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 0.01 \end{bmatrix}

This is already diagonal, so the SVD is trivial: U=V=IU = V = I (the identity matrix) and Σ=M\Sigma = M itself. The singular values are σ1=3\sigma_1 = 3, σ2=2\sigma_2 = 2, σ3=0.01\sigma_3 = 0.01. Notice that the third singular value is tiny — it contributes almost nothing. If we set it to zero, we get a rank-2 approximation that is nearly identical to the original:

M[300020000]M \approx \begin{bmatrix} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 0 \end{bmatrix}

We just went from 9 parameters to describing the matrix with only 2 meaningful values — and barely lost any information.

Of course, real matrices aren't diagonal. But that's exactly what SVD does: it finds the right rotations (UU and VV) so that after rotating, the matrix becomes diagonal (Σ\Sigma). Then you can look at the diagonal entries and decide which ones actually matter.

Why does this help? If the singular values drop off quickly — say σ1=100,σ2=50,σ3=0.1,σ4=0.01,\sigma_1 = 100, \sigma_2 = 50, \sigma_3 = 0.1, \sigma_4 = 0.01, \ldots — then the matrix is well-approximated by keeping only the top rr components:

MUrΣrVrTM \approx U_r \Sigma_r V_r^T

Here UrU_r is the first rr columns of UU, Σr\Sigma_r is the r×rr \times r top-left block of Σ\Sigma, and VrTV_r^T is the first rr rows of VTV^T. This is known as the Eckart–Young theorem: this rank-rr matrix is the best possible rank-rr approximation to MM (in terms of minimizing the Frobenius norm of the error).

This rank-rr approximation uses r(d+d+1)=r(2d+1)r(d + d + 1) = r(2d+1) parameters instead of d2d^2. For d=4096d = 4096 and r=16r = 16, that's 131,088131,088 instead of 16,777,21616,777,216 - a 128x reduction.

The interactive visualization below lets you see this in action. Drag the slider to change the rank and watch how the approximation converges to the original:

Notice something remarkable: even at very low ranks (5-10), the approximation captures the main structure. The cross pattern, the circular ring, the smooth gradients - all the meaningful information lives in a low-dimensional subspace. The high-rank components are just noise.

Aghajanyan et al.: Intrinsic Dimensionality

In 2021, Aghajanyan, Gupta, and Zettlemoyer published a paper called "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." Their key finding:

Pre-trained language models have a very low intrinsic dimensionality for downstream tasks.

What does this mean? They projected gradient updates into random low-dimensional subspaces and found that fine-tuning worked almost as well. A RoBERTa model with 355 million parameters could be fine-tuned effectively in a subspace of dimension 200 - that's 0.00006% of the full parameter space.

This makes intuitive sense. Pre-training already teaches the model a rich representation of language. Fine-tuning for a specific task is a small adjustment to that representation - moving through a low-dimensional manifold in parameter space, not exploring the full NN-dimensional landscape.

The weight change ΔW=Wfine-tunedWpre-trained\Delta W = W_{\text{fine-tuned}} - W_{\text{pre-trained}} has low intrinsic rank. This is the key insight that makes LoRA possible.

LoRA: Low-Rank Adaptation

The core idea

LoRA (Hu et al., 2021) makes the low-rank hypothesis explicit and practical. Instead of updating a weight matrix WRd×dW \in \mathbb{R}^{d \times d} directly, we decompose the update into two smaller matrices:

W=W+ΔW=W+BAW' = W + \Delta W = W + BA

where:

  • BRd×rB \in \mathbb{R}^{d \times r} (the "up-projection")
  • ARr×dA \in \mathbb{R}^{r \times d} (the "down-projection")
  • rdr \ll d is the rank

The original weight WW stays frozen - we never compute gradients for it, never store optimizer states for it. Only BB and AA are trainable.

Drag the rank slider above. At r=8r = 8 with d=64d = 64, we need 64×8+8×64=1,02464 \times 8 + 8 \times 64 = 1{,}024 trainable parameters instead of 64×64=4,09664 \times 64 = 4{,}096. That's a 4x reduction in this toy example. At real model scales (d=4096d = 4096, r=16r = 16), the ratio is 256x.

Parameter savings math

For a single weight matrix WRdout×dinW \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}:

MethodTrainable params
Full fine-tuningdout×dind_{\text{out}} \times d_{\text{in}}
LoRA (rank rr)dout×r+r×dind_{\text{out}} \times r + r \times d_{\text{in}}
Savingsdout×dinr(dout+din)\frac{d_{\text{out}} \times d_{\text{in}}}{r(d_{\text{out}} + d_{\text{in}})} times fewer

For a square matrix (dout=din=dd_{\text{out}} = d_{\text{in}} = d), the savings simplify to d2r\frac{d}{2r}. With d=4096d = 4096 and r=16r = 16:

Savings=40962×16=128×\text{Savings} = \frac{4096}{2 \times 16} = 128\times

In a transformer model, LoRA is typically applied to the attention projection matrices (Q, K, V, O). A 7B model like LLaMA-2-7B has 32 transformer layers, each with 4 projection matrices of size 4096×40964096 \times 4096. With rank r=16r = 16:

LoRA params=32×4×2×4096×16=16,777,21616.8M\text{LoRA params} = 32 \times 4 \times 2 \times 4096 \times 16 = 16{,}777{,}216 \approx 16.8\text{M}

That's 16.8 million trainable parameters out of 7 billion total - roughly 0.24% of the model. And yet this achieves performance competitive with full fine-tuning on many tasks.

The forward pass

During the forward pass, both paths run in parallel:

h=Wx+αrBAxh = Wx + \frac{\alpha}{r} BAx

The input xx flows through two paths:

  1. Frozen path: WxWx - the original pretrained computation
  2. LoRA path: BAxBAx - the learned adaptation, scaled by α/r\alpha/r

The results are summed to produce the output hh.

The scaling factor α/r\alpha/r is important. It controls the magnitude of the LoRA update relative to the original weights. The parameter α\alpha is a hyperparameter (typically set to 16 or 32). Dividing by rr means that when you increase the rank, each individual component contributes less - the total update magnitude stays approximately constant across different rank choices. This makes α\alpha roughly invariant to the choice of rr.

Initialization

LoRA uses a specific initialization scheme that ensures the model starts exactly where pre-training left off:

  • AA is initialized with random Gaussian values (small, like Kaiming init)
  • BB is initialized to all zeros

This means BA=0BA = 0 at the start of training. The model's output is initially identical to the pre-trained model. Training then gradually learns the low-rank update ΔW=BA\Delta W = BA.

This is crucial for stability. You're not perturbing the pretrained model at initialization - you're starting from exactly the pretrained weights and smoothly moving toward the fine-tuned solution.

python
import torch import torch.nn as nn import math class LoRALinear(nn.Module): """A linear layer with LoRA adaptation.""" def __init__(self, in_features: int, out_features: int, rank: int = 8, alpha: float = 16.0): super().__init__() self.in_features = in_features self.out_features = out_features self.rank = rank self.alpha = alpha self.scaling = alpha / rank # Frozen original weight self.weight = nn.Parameter(torch.randn(out_features, in_features), requires_grad=False) # LoRA matrices self.lora_A = nn.Parameter(torch.randn(rank, in_features) * (1 / math.sqrt(in_features))) self.lora_B = nn.Parameter(torch.zeros(out_features, rank)) # zero init! def forward(self, x: torch.Tensor) -> torch.Tensor: # Frozen path + LoRA path h = x @ self.weight.T # standard linear h = h + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling # LoRA update return h

Let's verify the zero-initialization property:

python
layer = LoRALinear(in_features=512, out_features=512, rank=8) x = torch.randn(1, 512) # At init, LoRA contributes nothing lora_out = (x @ layer.lora_A.T @ layer.lora_B.T) * layer.scaling print(f"LoRA output at init: {lora_out.abs().max().item():.6f}") # 0.000000 # Full output is just Wx full_out = layer(x) frozen_out = x @ layer.weight.T print(f"Max difference: {(full_out - frozen_out).abs().max().item():.6f}") # 0.000000

Where to Apply LoRA

Transformer weight matrices

A standard transformer layer has several weight matrices. Which ones should get LoRA adapters?

Attention projections:

  • WQW_Q (query projection): dmodeldmodeld_{\text{model}} \to d_{\text{model}}
  • WKW_K (key projection): dmodeldmodeld_{\text{model}} \to d_{\text{model}}
  • WVW_V (value projection): dmodeldmodeld_{\text{model}} \to d_{\text{model}}
  • WOW_O (output projection): dmodeldmodeld_{\text{model}} \to d_{\text{model}}

Feed-forward network (FFN):

  • WupW_{\text{up}} (up-projection): dmodeldffd_{\text{model}} \to d_{\text{ff}} (typically dff=4×dmodeld_{\text{ff}} = 4 \times d_{\text{model}})
  • WdownW_{\text{down}} (down-projection): dffdmodeld_{\text{ff}} \to d_{\text{model}}
  • WgateW_{\text{gate}} (gate projection, in gated FFNs like LLaMA): dmodeldffd_{\text{model}} \to d_{\text{ff}}

What the original paper found

Hu et al. (2021) experimented with different combinations on GPT-3 175B. Their findings:

LoRA applied toParamsWikiSQL AccMultiNLI Acc
WQW_Q only4.7M73.491.7
WKW_K only4.7M73.291.3
WVW_V only4.7M73.891.7
WQ,WVW_Q, W_V9.4M74.491.7
WQ,WK,WV,WOW_Q, W_K, W_V, W_O18.9M74.691.8

Key takeaways:

  1. WQW_Q and WVW_V together gives most of the benefit. Adding WKW_K and WOW_O helps only marginally.
  2. More matrices with lower rank can outperform fewer matrices with higher rank, given the same total parameter budget.
  3. Even a single matrix (WQW_Q or WVW_V) gets remarkably close to the best result.

Modern practice

In current practice (2024-2026), most practitioners apply LoRA to all linear layers - attention projections plus FFN weights. The HuggingFace PEFT library makes this trivial. Here's why the trend shifted:

  1. FFN weights store factual knowledge. The attention layers learn how to route information, but the FFN layers store what the model knows. For knowledge-intensive tasks (QA, factual generation), adapting FFN layers is important.
  2. The cost is still tiny. Even with LoRA on all matrices, trainable parameters are well under 1% of total.
  3. Rank can be lower. With more matrices adapted, each individual rank can be smaller while achieving the same total expressiveness.
python
# Modern practice: LoRA on all linear layers from peft import LoraConfig config = LoraConfig( r=16, lora_alpha=32, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", # attention "gate_proj", "up_proj", "down_proj", # FFN (LLaMA-style) ], lora_dropout=0.05, bias="none", )

Rank Selection: How to Choose r

The rank rr is LoRA's most important hyperparameter. Too low and the model can't learn the task. Too high and you waste memory without benefit.

The surprising effectiveness of low ranks

One of the most striking findings in the LoRA paper is how effective very low ranks are. On GPT-3 175B:

Rank rrTrainable ParamsWikiSQL AccMultiNLI Acc
10.3M73.291.2
20.6M73.691.6
41.2M73.991.5
82.4M74.091.7
6418.9M73.791.6

Rank 4 is essentially as good as rank 64. Rank 1 is within 1% of the best! This strongly supports the low-rank hypothesis: the meaningful fine-tuning update really does live in a very low-dimensional subspace.

Performance even decreases slightly at rank 64. This is likely overfitting - with more parameters, the model starts memorizing training examples rather than learning the general pattern. LoRA's rank constraint acts as an implicit regularizer (similar to what we discussed in the regularization post).

Practical guidelines

Based on the literature and practitioner experience:

RankUse case
r=4r = 4Simple classification, style transfer
r=8r = 8General instruction tuning, most NLP tasks
r=16r = 16The safe default; works well almost everywhere
r=32r = 32Complex reasoning, math, coding tasks
r=64r = 64When you have abundant data and compute; diminishing returns
r=128+r = 128+Rarely needed; consider full fine-tuning if this seems necessary

The heuristic: start with r=16r = 16. If your task is simple and your data is small, try r=4r = 4 or r=8r = 8. If performance isn't sufficient, increase rank before trying other interventions.

The rank-alpha relationship

The scaling factor α/r\alpha/r means that changing rank changes the effective learning rate of the LoRA parameters. Common conventions:

  • α=r\alpha = r: Effective scaling is 1.0. The LoRA update magnitude is independent of rank.
  • α=2r\alpha = 2r: The default in many implementations. Slightly amplifies the LoRA update.
  • Fixed α=16\alpha = 16 or 3232: Used when the practitioner wants to sweep rank without changing the effective scale.

When sweeping rank, keep α/r\alpha / r constant (or equivalently, adjust the learning rate proportionally). Otherwise, you're confounding rank with learning rate.

python
# Rank sweep with constant effective scaling configs = [ LoraConfig(r=4, lora_alpha=8, ...), # alpha/r = 2 LoraConfig(r=8, lora_alpha=16, ...), # alpha/r = 2 LoraConfig(r=16, lora_alpha=32, ...), # alpha/r = 2 LoraConfig(r=32, lora_alpha=64, ...), # alpha/r = 2 ]

QLoRA: Quantization Meets LoRA

LoRA reduces the trainable parameter count and the optimizer memory. But the frozen base model still needs to be loaded into GPU memory. A 65B model in fp16 is 130 GB - far beyond any single GPU.

QLoRA (Dettmers et al., 2023) solves this by quantizing the frozen base model to 4 bits while keeping LoRA adapters in fp16/bf16. Three innovations make this work:

1. 4-bit NormalFloat (NF4) quantization

Standard 4-bit integer quantization is too crude for neural network weights, which follow an approximately normal (Gaussian) distribution. NF4 is an information-theoretically optimal data type for normally-distributed data.

The idea: map the 16 possible 4-bit values to the quantiles of a standard normal distribution. This ensures equal numbers of weights map to each quantization bucket, minimizing information loss.

qi=Φ1(i+0.516),i=0,1,,15q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right), \quad i = 0, 1, \ldots, 15

where Φ1\Phi^{-1} is the inverse normal CDF (the quantile function). The quantization levels are:

NF4 levels: [-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]

Each weight is normalized by the block's absmax value, mapped to the nearest NF4 level, and stored as a 4-bit index. This achieves much better precision than naive int4 quantization.

2. Double quantization

Each block of weights (typically 64 values) requires a scaling constant (absmax value) stored in fp32 - that's 32 bits of overhead per 64 weights, or 0.5 bits per parameter.

Double quantization quantizes these scaling constants themselves to 8-bit floats, using a second level of block-wise quantization (blocks of 256 scaling constants). This reduces the overhead from 0.5 bits to ~0.127 bits per parameter. Not a huge savings, but it adds up at scale.

3. Paged optimizers

When GPU memory is almost full, occasional spikes (from long sequences or large batches) can cause OOM errors. QLoRA uses NVIDIA's unified memory to automatically page optimizer states to CPU RAM when GPU memory is exhausted, and page them back when needed. This acts as a safety net against OOM crashes.

Memory comparison

Let's do the full accounting for a 65B model:

ComponentFull FT (fp16)LoRA (fp16)QLoRA (NF4+fp16)
Base model130 GB130 GB33 GB (4-bit)
LoRA adapters-0.08 GB0.08 GB
Gradients130 GB0.08 GB0.08 GB
Optimizer states520 GB0.32 GB0.32 GB
Total780 GB130.5 GB33.5 GB

QLoRA fits on a single 48GB GPU. Full fine-tuning needs a cluster of 10+ A100-80GB GPUs.

And the performance? Dettmers et al. showed that QLoRA matches 16-bit LoRA on essentially every benchmark. The 4-bit quantization of the frozen weights introduces negligible error because the LoRA adapters learn to compensate.

QLoRA in practice

python
import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training # Quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # double quantization ) # Load model in 4-bit model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", quantization_config=bnb_config, device_map="auto", ) # Prepare for k-bit training (handles gradient checkpointing, etc.) model = prepare_model_for_kbit_training(model) # Add LoRA adapters lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 83,886,080 || all params: 68,977,704,960 || trainable%: 0.1216%

Other Parameter-Efficient Methods

LoRA isn't the only PEFT method. Let's survey the landscape and see how alternatives compare.

Prefix Tuning (Li & Liang, 2021)

Idea: prepend learnable "virtual tokens" to the key and value sequences in every attention layer. These prefix vectors are the only trainable parameters.

For a model with LL layers and prefix length pp:

Trainable params=2×L×p×dmodel\text{Trainable params} = 2 \times L \times p \times d_{\text{model}}

The factor of 2 is for keys and values. With L=32L = 32, p=20p = 20, d=4096d = 4096:

Params=2×32×20×4096=5,242,8805.2M\text{Params} = 2 \times 32 \times 20 \times 4096 = 5{,}242{,}880 \approx 5.2\text{M}

Pros: Elegant formulation; no modification to model weights. Cons: Reduces effective context length by pp tokens; training can be unstable; inference has non-trivial overhead (extra KV entries at every layer).

Adapters (Houlsby et al., 2019)

Idea: insert small bottleneck layers (down-project, nonlinearity, up-project) between existing transformer sub-layers.

x → LayerNorm → Attention → x + Adapter(Attention_out) → LayerNorm → FFN → x + Adapter(FFN_out)

Each adapter has 2×d×r+r2 \times d \times r + r parameters (two projections plus a bias). With adapters after both attention and FFN in all 32 layers:

Params=32×2×(2×4096×64+64)33.6M\text{Params} = 32 \times 2 \times (2 \times 4096 \times 64 + 64) \approx 33.6\text{M}

Pros: Well-studied; works reliably across tasks. Cons: Adds sequential computation to the forward pass (adapters can't be parallelized with the main path). This increases inference latency.

IA3 (Liu et al., 2022)

Idea: learn scaling vectors that rescale the keys, values, and FFN intermediate activations. No new weight matrices - just element-wise multiplication.

K=lKK,V=lVV,FFNout=lffFFNoutK' = l_K \odot K, \quad V' = l_V \odot V, \quad \text{FFN}_\text{out}' = l_\text{ff} \odot \text{FFN}_\text{out}

where lK,lVRdmodell_K, l_V \in \mathbb{R}^{d_{\text{model}}} and lffRdffl_\text{ff} \in \mathbb{R}^{d_{\text{ff}}}.

Params per layer: 2dmodel+dff=2(4096)+16384=24,5762d_{\text{model}} + d_{\text{ff}} = 2(4096) + 16384 = 24{,}576. Total for 32 layers: 32×24,576=786,4320.8M32 \times 24{,}576 = 786{,}432 \approx 0.8\text{M}.

Pros: Extremely few parameters; zero inference overhead (scaling can be fused into weights). Cons: Less expressive than LoRA; struggles on tasks requiring significant adaptation.

Comparison

MethodTrainable Params (7B model)Inference OverheadTraining StabilityPerformance
Full Fine-Tuning7B (100%)NoneGoodBest
LoRA (r=16)~17M (0.24%)None (merge)GoodNear-best
QLoRA (r=16)~17M (0.24%)None (merge)GoodNear-best
Prefix Tuning (p=20)~5M (0.07%)ModerateUnstableGood
Adapters (r=64)~34M (0.49%)SequentialGoodGood
IA3~0.8M (0.01%)NoneGoodModerate

LoRA's unique advantage: zero inference overhead. Because W+BAW + BA can be merged into a single matrix at deployment time, there's no extra computation during inference. Adapters add sequential bottleneck layers. Prefix tuning adds extra KV entries. LoRA adds nothing.

This is why LoRA dominates in practice.

Merging LoRA Weights

The merge operation

One of LoRA's most elegant properties: at inference time, you can compute W=W+BAW' = W + BA once and replace the original weight. The model is now identical to one that was fully fine-tuned (at the rank-rr approximation level), with zero runtime overhead.

python
# During training: two separate paths def forward_training(x, W, B, A, scaling): return x @ W.T + (x @ A.T @ B.T) * scaling # At inference: merge and use a single matrix W_merged = W + scaling * (B @ A) def forward_inference(x, W_merged): return x @ W_merged.T # identical result, no overhead

With the PEFT library:

python
# Merge LoRA weights into the base model model = model.merge_and_unload() # Now the model is a standard model with no LoRA adapters # Inference is exactly as fast as the original model model.save_pretrained("./merged_model")

Multiple LoRA adapters: model switching

Because LoRA adapters are small (~50-100 MB for a 7B model), you can train many adapters for different tasks and swap them at runtime:

python
from peft import PeftModel # Load base model once base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") # Load different LoRA adapters for different tasks model = PeftModel.from_pretrained(base_model, "customer-support-adapter") # ... serve customer support requests ... # Switch to a different adapter model.load_adapter("code-generation-adapter") # ... serve code generation requests ... # Or even combine adapters (experimental) model.add_weighted_adapter( adapters=["customer-support-adapter", "code-generation-adapter"], weights=[0.7, 0.3], adapter_name="hybrid", )

This is how companies serve hundreds of specialized models from a single base model in production. The base model stays in GPU memory (expensive), while LoRA adapters are swapped in and out (cheap - just loading a few MB of weights).

LoRA adapter arithmetic

Because LoRA adapters are linear updates (ΔW=BA\Delta W = BA), they support linear arithmetic:

Wcombined=W+λ1B1A1+λ2B2A2W_{\text{combined}} = W + \lambda_1 B_1 A_1 + \lambda_2 B_2 A_2

This enables:

  • Task interpolation: blend a code adapter and a math adapter to get a model good at both
  • Task negation: subtract an adapter's effect (λ<0\lambda < 0) to remove a capability
  • Progressive merging: gradually increase λ\lambda from 0 to 1 to smoothly transition between behaviors

This linearity doesn't hold for full fine-tuning (nonlinear optimization landscape), making LoRA uniquely composable.

Advanced LoRA Variants

The success of LoRA has spawned numerous variants. Here are the most important ones:

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA (Liu et al., 2024) decomposes the weight update into magnitude and direction components:

W=mW+BAW+BAcW' = m \cdot \frac{W + BA}{\|W + BA\|_c}

where mm is a learnable magnitude vector and c\|\cdot\|_c denotes column-wise normalization. This is inspired by weight normalization and consistently outperforms standard LoRA by 1-2% across benchmarks, with minimal additional cost (just one extra vector mm).

AdaLoRA (Adaptive LoRA)

Standard LoRA uses the same rank rr for every weight matrix. AdaLoRA (Zhang et al., 2023) dynamically allocates rank across layers and matrices based on importance scores:

ri=rbudgetsijsjr_i = r_{\text{budget}} \cdot \frac{s_i}{\sum_j s_j}

Layers with higher importance (measured by gradient-based sensitivity) get more rank. Empirically, AdaLoRA concentrates rank in the lower and upper layers (which tend to be more task-specific), while middle layers get lower rank.

LoRA+ (Different Learning Rates for A and B)

Hayou et al. (2024) showed that using different learning rates for AA and BB improves convergence. Specifically, setting ηB=ληA\eta_B = \lambda \cdot \eta_A with λ16\lambda \approx 16 consistently outperforms standard LoRA with a single learning rate. The intuition: AA handles the down-projection and benefits from a smaller learning rate for stability, while BB handles the up-projection and can learn faster.

Practical Guide: Training with LoRA

Let's put it all together with a complete training example. We'll fine-tune a language model for instruction following using QLoRA.

Setup

python
import torch from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, ) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer from datasets import load_dataset # Model and quantization model_name = "meta-llama/Llama-2-7b-hf" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) # Load model model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token # Prepare for training model = prepare_model_for_kbit_training(model)

Configure LoRA

python
lora_config = LoraConfig( r=16, # rank lora_alpha=32, # scaling (alpha/r = 2) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_dropout=0.05, # regularization bias="none", # don't train biases task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 13,631,488 || all params: 6,751,809,536 || trainable%: 0.2019%

Training loop

python
# Load instruction-following dataset dataset = load_dataset("tatsu-lab/alpaca", split="train") # Format into chat template def format_instruction(example): if example["input"]: text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}" else: text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}" return {"text": text} dataset = dataset.map(format_instruction) # Training arguments training_args = TrainingArguments( output_dir="./lora-llama2-7b", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, # effective batch size = 16 learning_rate=2e-4, # higher LR for LoRA than full FT weight_decay=0.01, warmup_ratio=0.03, lr_scheduler_type="cosine", logging_steps=10, save_strategy="epoch", bf16=True, gradient_checkpointing=True, # saves memory at cost of ~20% speed optim="paged_adamw_8bit", # QLoRA paged optimizer max_grad_norm=0.3, ) # Train trainer = SFTTrainer( model=model, train_dataset=dataset, args=training_args, tokenizer=tokenizer, max_seq_length=1024, dataset_text_field="text", ) trainer.train()

Key hyperparameter notes

A few details that matter in practice:

Learning rate: LoRA benefits from a higher learning rate than full fine-tuning. Typical values are 2×1042 \times 10^{-4} to 3×1043 \times 10^{-4}, compared to 1×1051 \times 10^{-5} to 5×1055 \times 10^{-5} for full fine-tuning. The reason: LoRA parameters start at zero (due to B=0B = 0 initialization), so they need larger updates to move away from the initialization.

Dropout: LoRA-specific dropout (lora_dropout) is applied between AA and BB, i.e., to the intermediate rank-rr representation. Values of 0.05-0.1 work well. This is separate from any dropout in the base model.

Gradient checkpointing: Always enable this for QLoRA training. It recomputes activations during backprop instead of storing them, trading ~20% speed for ~60% memory savings on activations.

Optimizer: paged_adamw_8bit combines the QLoRA paged optimizer (OOM safety net) with 8-bit Adam (further memory savings on optimizer states). The 8-bit quantization of optimizer states has been shown to cause no degradation in practice.

Save and merge

python
# Save LoRA adapter (small - ~50MB) model.save_pretrained("./lora-adapter") # Later: merge for deployment from peft import PeftModel base = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) merged = PeftModel.from_pretrained(base, "./lora-adapter") merged = merged.merge_and_unload() merged.save_pretrained("./merged-model") # This is now a standard model with no LoRA overhead

Debugging LoRA Training

Common issues and how to fix them:

Loss doesn't decrease

  • Check rank: Try increasing to r=32r = 32 or r=64r = 64
  • Check target modules: Are you applying LoRA to the right layers? Use model.print_trainable_parameters() to verify
  • Check learning rate: LoRA needs higher LR than full FT (try 2×1042 \times 10^{-4})
  • Check alpha: If α/r\alpha/r is too small, LoRA updates are suppressed

Loss decreases but eval quality is poor

  • Overfitting: Lower rank, increase dropout, or add more data
  • Data quality: Garbage in, garbage out. Check training examples
  • Evaluation mismatch: Ensure eval format matches training format

Out of memory

  • Enable gradient checkpointing: model.gradient_checkpointing_enable()
  • Reduce batch size: Use gradient accumulation to maintain effective batch size
  • Switch to QLoRA: 4-bit base model + fp16 LoRA
  • Reduce sequence length: Memory scales linearly with sequence length

Performance gap with full fine-tuning

This is rare with proper hyperparameters, but if it happens:

  • Increase rank: Try r=64r = 64 or r=128r = 128
  • Apply to more layers: Include FFN weights, not just attention
  • More epochs: LoRA sometimes needs more epochs to converge
  • Try DoRA: The magnitude-direction decomposition often closes the gap

When to Use What

A decision framework:

Full fine-tuning when:

  • You have abundant compute (multi-GPU cluster)
  • Your task requires fundamentally different behavior from the base model
  • Maximum performance matters more than efficiency

LoRA when:

  • Single GPU or limited multi-GPU setup
  • Your task is an adaptation of the base model's capabilities
  • You need multiple task-specific models from one base
  • You want zero inference overhead

QLoRA when:

  • Single consumer GPU (24-48 GB)
  • Fine-tuning models larger than your GPU can hold in fp16
  • Development/experimentation (fast iteration)

IA3 / Prefix Tuning when:

  • Extremely limited parameter budget
  • Very simple tasks (classification, style transfer)
  • Few-shot scenarios where even LoRA might overfit

The Bigger Picture

LoRA changed how we think about model adaptation. Before LoRA, fine-tuning was an all-or-nothing affair: you either updated every parameter or used prompting tricks. LoRA showed that the parameter space of fine-tuning is much smaller than the model itself - you can navigate it with a tiny steering wheel.

The implications extend beyond efficiency:

  1. Democratization: Anyone with a consumer GPU can fine-tune state-of-the-art models. The 65B parameter barrier became a 48GB barrier, and the 7B barrier became a 6GB barrier with QLoRA.

  2. Multi-tenant serving: One base model, many LoRA adapters. Companies like Predibase and Modal deploy hundreds of specialized models from a single base, swapping adapters at request time.

  3. Composition: LoRA adapters can be added, subtracted, and interpolated. This opens up model merging, task arithmetic, and continual learning without catastrophic forgetting.

  4. Scientific insight: The success of low-rank adaptation tells us something deep about the geometry of neural network loss landscapes. Fine-tuning doesn't explore the full parameter space - it follows low-dimensional trajectories. Understanding why this works is an active area of research.

The trajectory from full fine-tuning to LoRA to QLoRA mirrors a broader pattern in deep learning: the most impactful ideas are often the simplest. LoRA is just matrix factorization applied to weight updates. QLoRA is just quantization applied to the frozen weights. The genius is in recognizing where simplicity suffices.


Next in the series: we'll explore mixture of experts (MoE) - how sparse models like Mixtral use conditional computation to scale to hundreds of billions of parameters while only activating a fraction at inference time.