LoRA & Parameter-Efficient Fine-Tuning - Adapting Giants on a Budget

Fine-tuning is how we take a general-purpose language model and make it yours - trained on your data, for your task. But there's a problem. A big one.

A 7-billion parameter model in fp16 takes 14 GB just to load the weights. Full fine-tuning requires storing gradients (another 14 GB) and optimizer states (Adam keeps two fp32 copies per parameter - that's 56 GB). Add it up: fine-tuning a 7B model needs roughly 84 GB of GPU memory. That's more than a single A100. For a 70B model? You need a cluster.

This is absurd. We want to teach the model a new task - maybe classify customer support tickets, or write code in a particular style. Do we really need to update all 7 billion parameters?

The answer, as it turns out, is no. Not even close.

The Fine-Tuning Dilemma

Let's break down where the memory goes during full fine-tuning. For a model with $N$ parameters:

Component	Memory (fp16/fp32)	7B Model
Model weights	$2N$ bytes (fp16)	14 GB
Gradients	$2N$ bytes (fp16)	14 GB
Adam optimizer (momentum)	$4N$ bytes (fp32)	28 GB
Adam optimizer (variance)	$4N$ bytes (fp32)	28 GB
Total	$12N$ bytes	84 GB

The optimizer states alone consume 4x the model size. This is because Adam (and AdamW, which we covered in the optimizers post) maintains two running averages per parameter in full precision (fp32).

And this is just the static memory. During the forward pass, you also store activations for backpropagation - that's additional memory proportional to batch size and sequence length.

The chart above tells the story. Select different model sizes and watch the bars. Full fine-tuning scales linearly and brutally. A 65B model needs over 780 GB - that's 10 A100-80GB GPUs just for optimizer states.

But look at LoRA and QLoRA. The bars barely grow. A 65B model with QLoRA fits on a single 48GB GPU. How?

The Low-Rank Hypothesis

What is matrix rank?

The rank of a matrix is the number of linearly independent rows (or equivalently, columns). It tells you the "true dimensionality" of the information the matrix encodes.

A $d \times d$ matrix has at most rank $d$ . But many real-world matrices have much lower effective rank - most of their information concentrates in a few dimensions, with the rest being noise or redundancy.

Consider the Singular Value Decomposition (SVD). Any matrix $M$ can be decomposed as:

M = U \Sigma V^T

where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix of singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_d \geq 0$ .

What does each piece mean? Think of any matrix as a transformation that acts on vectors. SVD breaks that transformation into three simpler steps:

$V^T$ (Rotate): First, rotate the input into a new coordinate system that aligns with the "natural axes" of the transformation — the directions along which the matrix acts most cleanly.
$\Sigma$ (Scale): Then, stretch or shrink along each of those axes. The singular values $\sigma_1, \sigma_2, \ldots$ are exactly these stretch factors, sorted from largest to smallest.
$U$ (Rotate again): Finally, rotate the result into the output coordinate system.

The singular values tell you how much "energy" each dimension carries. A large $\sigma_i$ means that dimension contributes a lot to the matrix's overall effect; a tiny $\sigma_i$ means that dimension is almost negligible.

A concrete example. Take this small $3 \times 3$ matrix:

M = \begin{bmatrix} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 0.01 \end{bmatrix}

This is already diagonal, so the SVD is trivial: $U = V = I$ (the identity matrix) and $\Sigma = M$ itself. The singular values are $\sigma_1 = 3$ , $\sigma_2 = 2$ , $\sigma_3 = 0.01$ . Notice that the third singular value is tiny — it contributes almost nothing. If we set it to zero, we get a rank-2 approximation that is nearly identical to the original:

M \approx \begin{bmatrix} 3 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 0 \end{bmatrix}

We just went from 9 parameters to describing the matrix with only 2 meaningful values — and barely lost any information.

Of course, real matrices aren't diagonal. But that's exactly what SVD does: it finds the right rotations ( $U$ and $V$ ) so that after rotating, the matrix becomes diagonal ( $\Sigma$ ). Then you can look at the diagonal entries and decide which ones actually matter.

Why does this help? If the singular values drop off quickly — say $\sigma_1 = 100, \sigma_2 = 50, \sigma_3 = 0.1, \sigma_4 = 0.01, \ldots$ — then the matrix is well-approximated by keeping only the top $r$ components:

M \approx U_r \Sigma_r V_r^T

Here $U_r$ is the first $r$ columns of $U$ , $\Sigma_r$ is the $r \times r$ top-left block of $\Sigma$ , and $V_r^T$ is the first $r$ rows of $V^T$ . This is known as the Eckart–Young theorem: this rank- $r$ matrix is the best possible rank- $r$ approximation to $M$ (in terms of minimizing the Frobenius norm of the error).

This rank- $r$ approximation uses $r(d + d + 1) = r(2d+1)$ parameters instead of $d^2$ . For $d = 4096$ and $r = 16$ , that's $131,088$ instead of $16,777,216$ - a 128x reduction.

The interactive visualization below lets you see this in action. Drag the slider to change the rank and watch how the approximation converges to the original:

Notice something remarkable: even at very low ranks (5-10), the approximation captures the main structure. The cross pattern, the circular ring, the smooth gradients - all the meaningful information lives in a low-dimensional subspace. The high-rank components are just noise.

Aghajanyan et al.: Intrinsic Dimensionality

In 2021, Aghajanyan, Gupta, and Zettlemoyer published a paper called "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." Their key finding:

Pre-trained language models have a very low intrinsic dimensionality for downstream tasks.

What does this mean? They projected gradient updates into random low-dimensional subspaces and found that fine-tuning worked almost as well. A RoBERTa model with 355 million parameters could be fine-tuned effectively in a subspace of dimension 200 - that's 0.00006% of the full parameter space.

This makes intuitive sense. Pre-training already teaches the model a rich representation of language. Fine-tuning for a specific task is a small adjustment to that representation - moving through a low-dimensional manifold in parameter space, not exploring the full $N$ -dimensional landscape.

The weight change $\Delta W = W_{\text{fine-tuned}} - W_{\text{pre-trained}}$ has low intrinsic rank. This is the key insight that makes LoRA possible.

LoRA: Low-Rank Adaptation

The core idea

LoRA (Hu et al., 2021) makes the low-rank hypothesis explicit and practical. Instead of updating a weight matrix $W \in \mathbb{R}^{d \times d}$ directly, we decompose the update into two smaller matrices:

W' = W + \Delta W = W + BA

where:

$B \in \mathbb{R}^{d \times r}$ (the "up-projection")
$A \in \mathbb{R}^{r \times d}$ (the "down-projection")
$r \ll d$ is the rank

The original weight $W$ stays frozen - we never compute gradients for it, never store optimizer states for it. Only $B$ and $A$ are trainable.

Drag the rank slider above. At $r = 8$ with $d = 64$ , we need $64 \times 8 + 8 \times 64 = 1{,}024$ trainable parameters instead of $64 \times 64 = 4{,}096$ . That's a 4x reduction in this toy example. At real model scales ( $d = 4096$ , $r = 16$ ), the ratio is 256x.

Parameter savings math

For a single weight matrix $W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ :

Method	Trainable params
Full fine-tuning	$d_{\text{out}} \times d_{\text{in}}$
LoRA (rank $r$ )	$d_{\text{out}} \times r + r \times d_{\text{in}}$
Savings	$\frac{d_{\text{out}} \times d_{\text{in}}}{r(d_{\text{out}} + d_{\text{in}})}$ times fewer

For a square matrix ( $d_{\text{out}} = d_{\text{in}} = d$ ), the savings simplify to $\frac{d}{2r}$ . With $d = 4096$ and $r = 16$ :

\text{Savings} = \frac{4096}{2 \times 16} = 128\times

In a transformer model, LoRA is typically applied to the attention projection matrices (Q, K, V, O). A 7B model like LLaMA-2-7B has 32 transformer layers, each with 4 projection matrices of size $4096 \times 4096$ . With rank $r = 16$ :

\text{LoRA params} = 32 \times 4 \times 2 \times 4096 \times 16 = 16{,}777{,}216 \approx 16.8\text{M}

That's 16.8 million trainable parameters out of 7 billion total - roughly 0.24% of the model. And yet this achieves performance competitive with full fine-tuning on many tasks.

The forward pass

During the forward pass, both paths run in parallel:

h = Wx + \frac{\alpha}{r} BAx

The input $x$ flows through two paths:

Frozen path: $Wx$ - the original pretrained computation
LoRA path: $BAx$ - the learned adaptation, scaled by $\alpha/r$

The results are summed to produce the output $h$ .

The scaling factor $\alpha/r$ is important. It controls the magnitude of the LoRA update relative to the original weights. The parameter $\alpha$ is a hyperparameter (typically set to 16 or 32). Dividing by $r$ means that when you increase the rank, each individual component contributes less - the total update magnitude stays approximately constant across different rank choices. This makes $\alpha$ roughly invariant to the choice of $r$ .

Initialization

LoRA uses a specific initialization scheme that ensures the model starts exactly where pre-training left off:

$A$ is initialized with random Gaussian values (small, like Kaiming init)
$B$ is initialized to all zeros

This means $BA = 0$ at the start of training. The model's output is initially identical to the pre-trained model. Training then gradually learns the low-rank update $\Delta W = BA$ .

This is crucial for stability. You're not perturbing the pretrained model at initialization - you're starting from exactly the pretrained weights and smoothly moving toward the fine-tuned solution.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch
import torch.nn as nn
import math

class LoRALinear(nn.Module):
    """A linear layer with LoRA adaptation."""

    def __init__(self, in_features: int, out_features: int, rank: int = 8, alpha: float = 16.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        # Frozen original weight
        self.weight = nn.Parameter(torch.randn(out_features, in_features), requires_grad=False)

        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * (1 / math.sqrt(in_features)))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))  # zero init!

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Frozen path + LoRA path
        h = x @ self.weight.T                        # standard linear
        h = h + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling  # LoRA update
        return h

Let's verify the zero-initialization property:

python

1
2
3
4
5
6
7
8
9
10
11
layer = LoRALinear(in_features=512, out_features=512, rank=8)
x = torch.randn(1, 512)

# At init, LoRA contributes nothing
lora_out = (x @ layer.lora_A.T @ layer.lora_B.T) * layer.scaling
print(f"LoRA output at init: {lora_out.abs().max().item():.6f}")  # 0.000000

# Full output is just Wx
full_out = layer(x)
frozen_out = x @ layer.weight.T
print(f"Max difference: {(full_out - frozen_out).abs().max().item():.6f}")  # 0.000000

Where to Apply LoRA

Transformer weight matrices

A standard transformer layer has several weight matrices. Which ones should get LoRA adapters?

Attention projections:

$W_Q$ (query projection): $d_{\text{model}} \to d_{\text{model}}$
$W_K$ (key projection): $d_{\text{model}} \to d_{\text{model}}$
$W_V$ (value projection): $d_{\text{model}} \to d_{\text{model}}$
$W_O$ (output projection): $d_{\text{model}} \to d_{\text{model}}$

Feed-forward network (FFN):

$W_{\text{up}}$ (up-projection): $d_{\text{model}} \to d_{\text{ff}}$ (typically $d_{\text{ff}} = 4 \times d_{\text{model}}$ )
$W_{\text{down}}$ (down-projection): $d_{\text{ff}} \to d_{\text{model}}$
$W_{\text{gate}}$ (gate projection, in gated FFNs like LLaMA): $d_{\text{model}} \to d_{\text{ff}}$

What the original paper found

Hu et al. (2021) experimented with different combinations on GPT-3 175B. Their findings:

LoRA applied to	Params	WikiSQL Acc	MultiNLI Acc
$W_Q$ only	4.7M	73.4	91.7
$W_K$ only	4.7M	73.2	91.3
$W_V$ only	4.7M	73.8	91.7
$W_Q, W_V$	9.4M	74.4	91.7
$W_Q, W_K, W_V, W_O$	18.9M	74.6	91.8

Key takeaways:

$W_Q$ and $W_V$ together gives most of the benefit. Adding $W_K$ and $W_O$ helps only marginally.
More matrices with lower rank can outperform fewer matrices with higher rank, given the same total parameter budget.
Even a single matrix ( $W_Q$ or $W_V$ ) gets remarkably close to the best result.

Modern practice

In current practice (2024-2026), most practitioners apply LoRA to all linear layers - attention projections plus FFN weights. The HuggingFace PEFT library makes this trivial. Here's why the trend shifted:

FFN weights store factual knowledge. The attention layers learn how to route information, but the FFN layers store what the model knows. For knowledge-intensive tasks (QA, factual generation), adapting FFN layers is important.
The cost is still tiny. Even with LoRA on all matrices, trainable parameters are well under 1% of total.
Rank can be lower. With more matrices adapted, each individual rank can be smaller while achieving the same total expressiveness.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
# Modern practice: LoRA on all linear layers
from peft import LoraConfig

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # attention
        "gate_proj", "up_proj", "down_proj",       # FFN (LLaMA-style)
    ],
    lora_dropout=0.05,
    bias="none",
)

Rank Selection: How to Choose r

The rank $r$ is LoRA's most important hyperparameter. Too low and the model can't learn the task. Too high and you waste memory without benefit.

The surprising effectiveness of low ranks

One of the most striking findings in the LoRA paper is how effective very low ranks are. On GPT-3 175B:

Rank $r$	Trainable Params	WikiSQL Acc	MultiNLI Acc
1	0.3M	73.2	91.2
2	0.6M	73.6	91.6
4	1.2M	73.9	91.5
8	2.4M	74.0	91.7
64	18.9M	73.7	91.6

Rank 4 is essentially as good as rank 64. Rank 1 is within 1% of the best! This strongly supports the low-rank hypothesis: the meaningful fine-tuning update really does live in a very low-dimensional subspace.

Performance even decreases slightly at rank 64. This is likely overfitting - with more parameters, the model starts memorizing training examples rather than learning the general pattern. LoRA's rank constraint acts as an implicit regularizer (similar to what we discussed in the regularization post).

Practical guidelines

Based on the literature and practitioner experience:

Rank	Use case
$r = 4$	Simple classification, style transfer
$r = 8$	General instruction tuning, most NLP tasks
$r = 16$	The safe default; works well almost everywhere
$r = 32$	Complex reasoning, math, coding tasks
$r = 64$	When you have abundant data and compute; diminishing returns
$r = 128+$	Rarely needed; consider full fine-tuning if this seems necessary

The heuristic: start with $r = 16$ . If your task is simple and your data is small, try $r = 4$ or $r = 8$ . If performance isn't sufficient, increase rank before trying other interventions.

The rank-alpha relationship

The scaling factor $\alpha/r$ means that changing rank changes the effective learning rate of the LoRA parameters. Common conventions:

$\alpha = r$ : Effective scaling is 1.0. The LoRA update magnitude is independent of rank.
$\alpha = 2r$ : The default in many implementations. Slightly amplifies the LoRA update.
Fixed $\alpha = 16$ or $32$ : Used when the practitioner wants to sweep rank without changing the effective scale.

When sweeping rank, keep $\alpha / r$ constant (or equivalently, adjust the learning rate proportionally). Otherwise, you're confounding rank with learning rate.

python

1
2
3
4
5
6
7
# Rank sweep with constant effective scaling
configs = [
    LoraConfig(r=4,  lora_alpha=8,  ...),  # alpha/r = 2
    LoraConfig(r=8,  lora_alpha=16, ...),  # alpha/r = 2
    LoraConfig(r=16, lora_alpha=32, ...),  # alpha/r = 2
    LoraConfig(r=32, lora_alpha=64, ...),  # alpha/r = 2
]

QLoRA: Quantization Meets LoRA

LoRA reduces the trainable parameter count and the optimizer memory. But the frozen base model still needs to be loaded into GPU memory. A 65B model in fp16 is 130 GB - far beyond any single GPU.

QLoRA (Dettmers et al., 2023) solves this by quantizing the frozen base model to 4 bits while keeping LoRA adapters in fp16/bf16. Three innovations make this work:

1. 4-bit NormalFloat (NF4) quantization

Standard 4-bit integer quantization is too crude for neural network weights, which follow an approximately normal (Gaussian) distribution. NF4 is an information-theoretically optimal data type for normally-distributed data.

The idea: map the 16 possible 4-bit values to the quantiles of a standard normal distribution. This ensures equal numbers of weights map to each quantization bucket, minimizing information loss.

q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right), \quad i = 0, 1, \ldots, 15

where $\Phi^{-1}$ is the inverse normal CDF (the quantile function). The quantization levels are:

1
2
NF4 levels: [-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
              0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]

Each weight is normalized by the block's absmax value, mapped to the nearest NF4 level, and stored as a 4-bit index. This achieves much better precision than naive int4 quantization.

2. Double quantization

Each block of weights (typically 64 values) requires a scaling constant (absmax value) stored in fp32 - that's 32 bits of overhead per 64 weights, or 0.5 bits per parameter.

Double quantization quantizes these scaling constants themselves to 8-bit floats, using a second level of block-wise quantization (blocks of 256 scaling constants). This reduces the overhead from 0.5 bits to ~0.127 bits per parameter. Not a huge savings, but it adds up at scale.

3. Paged optimizers

When GPU memory is almost full, occasional spikes (from long sequences or large batches) can cause OOM errors. QLoRA uses NVIDIA's unified memory to automatically page optimizer states to CPU RAM when GPU memory is exhausted, and page them back when needed. This acts as a safety net against OOM crashes.

Memory comparison

Let's do the full accounting for a 65B model:

Component	Full FT (fp16)	LoRA (fp16)	QLoRA (NF4+fp16)
Base model	130 GB	130 GB	33 GB (4-bit)
LoRA adapters	-	0.08 GB	0.08 GB
Gradients	130 GB	0.08 GB	0.08 GB
Optimizer states	520 GB	0.32 GB	0.32 GB
Total	780 GB	130.5 GB	33.5 GB

QLoRA fits on a single 48GB GPU. Full fine-tuning needs a cluster of 10+ A100-80GB GPUs.

And the performance? Dettmers et al. showed that QLoRA matches 16-bit LoRA on essentially every benchmark. The 4-bit quantization of the frozen weights introduces negligible error because the LoRA adapters learn to compensate.

QLoRA in practice

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",       # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # double quantization
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training (handles gradient checkpointing, etc.)
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 68,977,704,960 || trainable%: 0.1216%

Other Parameter-Efficient Methods

LoRA isn't the only PEFT method. Let's survey the landscape and see how alternatives compare.

Prefix Tuning (Li & Liang, 2021)

Idea: prepend learnable "virtual tokens" to the key and value sequences in every attention layer. These prefix vectors are the only trainable parameters.

For a model with $L$ layers and prefix length $p$ :

\text{Trainable params} = 2 \times L \times p \times d_{\text{model}}

The factor of 2 is for keys and values. With $L = 32$ , $p = 20$ , $d = 4096$ :

\text{Params} = 2 \times 32 \times 20 \times 4096 = 5{,}242{,}880 \approx 5.2\text{M}

Pros: Elegant formulation; no modification to model weights. Cons: Reduces effective context length by $p$ tokens; training can be unstable; inference has non-trivial overhead (extra KV entries at every layer).

Adapters (Houlsby et al., 2019)

Idea: insert small bottleneck layers (down-project, nonlinearity, up-project) between existing transformer sub-layers.

1
x → LayerNorm → Attention → x + Adapter(Attention_out) → LayerNorm → FFN → x + Adapter(FFN_out)

Each adapter has $2 \times d \times r + r$ parameters (two projections plus a bias). With adapters after both attention and FFN in all 32 layers:

\text{Params} = 32 \times 2 \times (2 \times 4096 \times 64 + 64) \approx 33.6\text{M}

Pros: Well-studied; works reliably across tasks. Cons: Adds sequential computation to the forward pass (adapters can't be parallelized with the main path). This increases inference latency.

IA3 (Liu et al., 2022)

Idea: learn scaling vectors that rescale the keys, values, and FFN intermediate activations. No new weight matrices - just element-wise multiplication.

K' = l_K \odot K, \quad V' = l_V \odot V, \quad \text{FFN}_\text{out}' = l_\text{ff} \odot \text{FFN}_\text{out}

where $l_K, l_V \in \mathbb{R}^{d_{\text{model}}}$ and $l_\text{ff} \in \mathbb{R}^{d_{\text{ff}}}$ .

Params per layer: $2d_{\text{model}} + d_{\text{ff}} = 2(4096) + 16384 = 24{,}576$ . Total for 32 layers: $32 \times 24{,}576 = 786{,}432 \approx 0.8\text{M}$ .

Pros: Extremely few parameters; zero inference overhead (scaling can be fused into weights). Cons: Less expressive than LoRA; struggles on tasks requiring significant adaptation.

Comparison

Method	Trainable Params (7B model)	Inference Overhead	Training Stability	Performance
Full Fine-Tuning	7B (100%)	None	Good	Best
LoRA (r=16)	~17M (0.24%)	None (merge)	Good	Near-best
QLoRA (r=16)	~17M (0.24%)	None (merge)	Good	Near-best
Prefix Tuning (p=20)	~5M (0.07%)	Moderate	Unstable	Good
Adapters (r=64)	~34M (0.49%)	Sequential	Good	Good
IA3	~0.8M (0.01%)	None	Good	Moderate

LoRA's unique advantage: zero inference overhead. Because $W + BA$ can be merged into a single matrix at deployment time, there's no extra computation during inference. Adapters add sequential bottleneck layers. Prefix tuning adds extra KV entries. LoRA adds nothing.

This is why LoRA dominates in practice.

Merging LoRA Weights

The merge operation

One of LoRA's most elegant properties: at inference time, you can compute $W' = W + BA$ once and replace the original weight. The model is now identical to one that was fully fine-tuned (at the rank- $r$ approximation level), with zero runtime overhead.

python

1
2
3
4
5
6
7
8
9
# During training: two separate paths
def forward_training(x, W, B, A, scaling):
    return x @ W.T + (x @ A.T @ B.T) * scaling

# At inference: merge and use a single matrix
W_merged = W + scaling * (B @ A)

def forward_inference(x, W_merged):
    return x @ W_merged.T  # identical result, no overhead

With the PEFT library:

python

1
2
3
4
5
6
# Merge LoRA weights into the base model
model = model.merge_and_unload()

# Now the model is a standard model with no LoRA adapters
# Inference is exactly as fast as the original model
model.save_pretrained("./merged_model")

Multiple LoRA adapters: model switching

Because LoRA adapters are small (~50-100 MB for a 7B model), you can train many adapters for different tasks and swap them at runtime:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load different LoRA adapters for different tasks
model = PeftModel.from_pretrained(base_model, "customer-support-adapter")
# ... serve customer support requests ...

# Switch to a different adapter
model.load_adapter("code-generation-adapter")
# ... serve code generation requests ...

# Or even combine adapters (experimental)
model.add_weighted_adapter(
    adapters=["customer-support-adapter", "code-generation-adapter"],
    weights=[0.7, 0.3],
    adapter_name="hybrid",
)

This is how companies serve hundreds of specialized models from a single base model in production. The base model stays in GPU memory (expensive), while LoRA adapters are swapped in and out (cheap - just loading a few MB of weights).

LoRA adapter arithmetic

Because LoRA adapters are linear updates ( $\Delta W = BA$ ), they support linear arithmetic:

W_{\text{combined}} = W + \lambda_1 B_1 A_1 + \lambda_2 B_2 A_2

This enables:

Task interpolation: blend a code adapter and a math adapter to get a model good at both
Task negation: subtract an adapter's effect ( $\lambda < 0$ ) to remove a capability
Progressive merging: gradually increase $\lambda$ from 0 to 1 to smoothly transition between behaviors

This linearity doesn't hold for full fine-tuning (nonlinear optimization landscape), making LoRA uniquely composable.

Advanced LoRA Variants

The success of LoRA has spawned numerous variants. Here are the most important ones:

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA (Liu et al., 2024) decomposes the weight update into magnitude and direction components:

W' = m \cdot \frac{W + BA}{\|W + BA\|_c}

where $m$ is a learnable magnitude vector and $\|\cdot\|_c$ denotes column-wise normalization. This is inspired by weight normalization and consistently outperforms standard LoRA by 1-2% across benchmarks, with minimal additional cost (just one extra vector $m$ ).

AdaLoRA (Adaptive LoRA)

Standard LoRA uses the same rank $r$ for every weight matrix. AdaLoRA (Zhang et al., 2023) dynamically allocates rank across layers and matrices based on importance scores:

r_i = r_{\text{budget}} \cdot \frac{s_i}{\sum_j s_j}

Layers with higher importance (measured by gradient-based sensitivity) get more rank. Empirically, AdaLoRA concentrates rank in the lower and upper layers (which tend to be more task-specific), while middle layers get lower rank.

LoRA+ (Different Learning Rates for A and B)

Hayou et al. (2024) showed that using different learning rates for $A$ and $B$ improves convergence. Specifically, setting $\eta_B = \lambda \cdot \eta_A$ with $\lambda \approx 16$ consistently outperforms standard LoRA with a single learning rate. The intuition: $A$ handles the down-projection and benefits from a smaller learning rate for stability, while $B$ handles the up-projection and can learn faster.

Practical Guide: Training with LoRA

Let's put it all together with a complete training example. We'll fine-tune a language model for instruction following using QLoRA.

Setup

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# Model and quantization
model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Prepare for training
model = prepare_model_for_kbit_training(model)

Configure LoRA

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
lora_config = LoraConfig(
    r=16,                        # rank
    lora_alpha=32,               # scaling (alpha/r = 2)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,           # regularization
    bias="none",                 # don't train biases
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 6,751,809,536 || trainable%: 0.2019%

Training loop

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Load instruction-following dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

# Format into chat template
def format_instruction(example):
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

dataset = dataset.map(format_instruction)

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora-llama2-7b",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # effective batch size = 16
    learning_rate=2e-4,              # higher LR for LoRA than full FT
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,     # saves memory at cost of ~20% speed
    optim="paged_adamw_8bit",        # QLoRA paged optimizer
    max_grad_norm=0.3,
)

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=1024,
    dataset_text_field="text",
)

trainer.train()

Key hyperparameter notes

A few details that matter in practice:

Learning rate: LoRA benefits from a higher learning rate than full fine-tuning. Typical values are $2 \times 10^{-4}$ to $3 \times 10^{-4}$ , compared to $1 \times 10^{-5}$ to $5 \times 10^{-5}$ for full fine-tuning. The reason: LoRA parameters start at zero (due to $B = 0$ initialization), so they need larger updates to move away from the initialization.

Dropout: LoRA-specific dropout (lora_dropout) is applied between $A$ and $B$ , i.e., to the intermediate rank- $r$ representation. Values of 0.05-0.1 work well. This is separate from any dropout in the base model.

Gradient checkpointing: Always enable this for QLoRA training. It recomputes activations during backprop instead of storing them, trading ~20% speed for ~60% memory savings on activations.

Optimizer: paged_adamw_8bit combines the QLoRA paged optimizer (OOM safety net) with 8-bit Adam (further memory savings on optimizer states). The 8-bit quantization of optimizer states has been shown to cause no degradation in practice.

Save and merge

python

1
2
3
4
5
6
7
8
9
10
11
# Save LoRA adapter (small - ~50MB)
model.save_pretrained("./lora-adapter")

# Later: merge for deployment
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "./lora-adapter")
merged = merged.merge_and_unload()
merged.save_pretrained("./merged-model")
# This is now a standard model with no LoRA overhead

Debugging LoRA Training

Common issues and how to fix them:

Loss doesn't decrease

Check rank: Try increasing to $r = 32$ or $r = 64$
Check target modules: Are you applying LoRA to the right layers? Use model.print_trainable_parameters() to verify
Check learning rate: LoRA needs higher LR than full FT (try $2 \times 10^{-4}$ )
Check alpha: If $\alpha/r$ is too small, LoRA updates are suppressed

Loss decreases but eval quality is poor

Overfitting: Lower rank, increase dropout, or add more data
Data quality: Garbage in, garbage out. Check training examples
Evaluation mismatch: Ensure eval format matches training format

Out of memory

Enable gradient checkpointing: model.gradient_checkpointing_enable()
Reduce batch size: Use gradient accumulation to maintain effective batch size
Switch to QLoRA: 4-bit base model + fp16 LoRA
Reduce sequence length: Memory scales linearly with sequence length

Performance gap with full fine-tuning

This is rare with proper hyperparameters, but if it happens:

Increase rank: Try $r = 64$ or $r = 128$
Apply to more layers: Include FFN weights, not just attention
More epochs: LoRA sometimes needs more epochs to converge
Try DoRA: The magnitude-direction decomposition often closes the gap

When to Use What

A decision framework:

Full fine-tuning when:

You have abundant compute (multi-GPU cluster)
Your task requires fundamentally different behavior from the base model
Maximum performance matters more than efficiency

LoRA when:

Single GPU or limited multi-GPU setup
Your task is an adaptation of the base model's capabilities
You need multiple task-specific models from one base
You want zero inference overhead

QLoRA when:

Single consumer GPU (24-48 GB)
Fine-tuning models larger than your GPU can hold in fp16
Development/experimentation (fast iteration)

IA3 / Prefix Tuning when:

Extremely limited parameter budget
Very simple tasks (classification, style transfer)
Few-shot scenarios where even LoRA might overfit

The Bigger Picture

LoRA changed how we think about model adaptation. Before LoRA, fine-tuning was an all-or-nothing affair: you either updated every parameter or used prompting tricks. LoRA showed that the parameter space of fine-tuning is much smaller than the model itself - you can navigate it with a tiny steering wheel.

The implications extend beyond efficiency:

Democratization: Anyone with a consumer GPU can fine-tune state-of-the-art models. The 65B parameter barrier became a 48GB barrier, and the 7B barrier became a 6GB barrier with QLoRA.
Multi-tenant serving: One base model, many LoRA adapters. Companies like Predibase and Modal deploy hundreds of specialized models from a single base, swapping adapters at request time.
Composition: LoRA adapters can be added, subtracted, and interpolated. This opens up model merging, task arithmetic, and continual learning without catastrophic forgetting.
Scientific insight: The success of low-rank adaptation tells us something deep about the geometry of neural network loss landscapes. Fine-tuning doesn't explore the full parameter space - it follows low-dimensional trajectories. Understanding why this works is an active area of research.

The trajectory from full fine-tuning to LoRA to QLoRA mirrors a broader pattern in deep learning: the most impactful ideas are often the simplest. LoRA is just matrix factorization applied to weight updates. QLoRA is just quantization applied to the frozen weights. The genius is in recognizing where simplicity suffices.

Next in the series: we'll explore mixture of experts (MoE) - how sparse models like Mixtral use conditional computation to scale to hundreds of billions of parameters while only activating a fraction at inference time.