Apr 3, 2026

Model Quantization - Squeezing Giants into Laptops

How to shrink a 14GB language model to 3.5GB with barely any quality loss. A deep dive into the math, methods, and magic of quantization: from number formats and PTQ to GPTQ, AWQ, and GGUF.

Model Quantization - Squeezing Giants into Laptops hero image

Large language models are extraordinary. They can write poetry, debug code, explain quantum mechanics. But they have a problem: they are enormous. A 7-billion parameter model stored in standard FP16 precision requires 14 GB of GPU memory just to hold the weights. No computation, no activations, no KV cache -- just the parameters sitting in VRAM. A 70B model? 140 GB. That doesn't fit on any single consumer GPU in existence.

This creates a memory wall -- a hard physical constraint that locks the most powerful models behind $10,000+ hardware. Quantization is the sledgehammer that breaks through it.

The core idea is beautifully simple: instead of storing each weight as a 16-bit or 32-bit floating-point number, store it as a 4-bit or 8-bit integer. Fewer bits per weight means less memory, faster memory bandwidth, and faster inference. The catch is that you lose some precision. The art of quantization is minimizing that loss.

The Memory Arithmetic

Let's make the problem concrete. Every parameter in a neural network is a number -- a weight or a bias. The memory footprint depends entirely on how many bits we use to represent each number.

For a model with PP parameters at bb bits per parameter:

Memory (bytes)=P×b8\text{Memory (bytes)} = P \times \frac{b}{8}

For a 7B parameter model:

PrecisionBitsMemoryFits on...
FP323228.0 GBA100 40GB
FP16/BF161614.0 GBRTX 4090 (barely)
INT887.0 GBRTX 3060 12GB
INT443.5 GBAny modern GPU, or even a MacBook

That last row is the magic. A model that required a 1,600GPUcannowrunona1,600 GPU can now run on a 200 one. The same model, the same architecture, the same capabilities (roughly) -- just stored more efficiently.

Play with the calculator below to see how different model sizes and precisions translate to memory requirements:

The memory savings of 4-bit quantization are a 4x reduction over FP16 and an 8x reduction over FP32. But memory is only half the story. Modern GPUs are memory-bandwidth bound during inference -- the bottleneck isn't computation, it's moving weight data from VRAM to the compute cores. Smaller weights mean less data to move, which means faster token generation. A 4-bit model typically generates tokens 2-3x faster than the same model in FP16.

Number Representations: The Bit-Level Foundation

To understand what we lose when we quantize, we first need to understand how computers represent numbers. This is the foundation everything else builds on.

Floating Point: The Scientific Notation of Hardware

Floating-point numbers use a representation inspired by scientific notation. The number 6.75-6.75 can be written as 1.1011×22-1.1011 \times 2^2 in binary. This decomposes into three fields:

  1. Sign bit (1 bit): Is the number positive or negative?
  2. Exponent (ee bits): What power of 2 to multiply by? (The "scale")
  3. Mantissa/Significand (mm bits): The fractional digits. (The "precision")

The value is reconstructed as:

(1)sign×2exponentbias×(1+mantissa)(-1)^{\text{sign}} \times 2^{\text{exponent} - \text{bias}} \times (1 + \text{mantissa})

The bias centers the exponent range around zero so we can represent both very large and very small numbers.

FP32: The Gold Standard

FP32 (IEEE 754 single precision) uses 32 bits:

  • 1 bit sign
  • 8 bits exponent (bias = 127), giving range 21262^{-126} to 21271.7×10382^{127} \approx 1.7 \times 10^{38}
  • 23 bits mantissa, giving ~7 decimal digits of precision

This is what CPUs use natively. It's the default for most scientific computing and was the standard for deep learning training until ~2018.

FP16: Half Precision

FP16 uses 16 bits:

  • 1 bit sign
  • 5 bits exponent (bias = 15), giving range 2142^{-14} to 215=32,7682^{15} = 32{,}768
  • 10 bits mantissa, giving ~3.3 decimal digits of precision

Half the bits, half the memory. But the reduced exponent range is a problem: the maximum representable value is only 65,504. In deep learning, loss values and gradient accumulations can easily exceed this, causing overflow (values becoming infinity). This is why mixed-precision training uses FP16 for forward/backward passes but keeps an FP32 master copy of the weights.

BF16: Brain Float -- Google's Trick

BF16 (Brain Floating Point) uses 16 bits differently:

  • 1 bit sign
  • 8 bits exponent (bias = 127), same range as FP32!
  • 7 bits mantissa, giving ~2.4 decimal digits of precision

The genius insight: for deep learning, dynamic range matters more than precision. Neural network training produces values spanning many orders of magnitude (gradients can be 10810^{-8} in one layer and 10310^{3} in another). BF16 keeps the full FP32 range while sacrificing some precision. In practice, the rounding errors from 7 mantissa bits rarely affect model quality.

BF16 is now the default for training large models on TPUs and modern NVIDIA GPUs (A100, H100).

INT8 and INT4: The Quantization Targets

Integer formats are fundamentally different from floating point. There's no exponent -- every number is uniformly spaced:

  • INT8: 256 discrete values, typically [128,127][-128, 127] (signed) or [0,255][0, 255] (unsigned)
  • INT4: 16 discrete values, typically [8,7][-8, 7] (signed) or [0,15][0, 15] (unsigned)

The uniform spacing is both the strength and weakness. Floating-point numbers have higher density near zero (where most neural network weights live) and lower density for large values. Integers space everything evenly, which wastes precision on the tails of the distribution.

Use the explorer below to type any number and see exactly how it gets stored in each format, bit by bit:

Why This Matters for Quantization

When we convert a model from FP16 to INT8, we're going from 65,536 possible values (per number) to 256 possible values. That's a 256x reduction in expressiveness. For INT4, it's 16 possible values -- a 4,096x reduction.

The entire field of quantization is about choosing which 16 (or 256) values to use, and how to map each weight to its closest representative.

Post-Training Quantization (PTQ)

The simplest approach: take a fully trained FP16 model and convert it to lower precision after training is complete. No retraining, no fine-tuning. Just map the weights.

Symmetric Quantization

In symmetric quantization, we center the quantization grid at zero. The mapping is:

xq=round(xs)x_q = \text{round}\left(\frac{x}{s}\right)

where the scale factor ss is determined by the maximum absolute value of the tensor:

s=max(x)2b11s = \frac{\max(|x|)}{2^{b-1} - 1}

To dequantize (reconstruct the approximate original value):

x^=xq×s\hat{x} = x_q \times s

For example, quantizing to INT8 (b=8b = 8):

  • If the maximum absolute weight is 1.5, then s=1.5/127=0.01181s = 1.5 / 127 = 0.01181
  • A weight of 0.750.75 maps to round(0.75/0.01181)=round(63.5)=64\text{round}(0.75 / 0.01181) = \text{round}(63.5) = 64
  • Dequantized: 64×0.01181=0.755964 \times 0.01181 = 0.7559 (error = 0.0059)

The simplicity is appealing, but the symmetric constraint wastes quantization levels when the distribution isn't centered at zero.

Asymmetric Quantization

Asymmetric quantization adds a zero-point offset to handle distributions that aren't centered at zero:

xq=round(xs)+zx_q = \text{round}\left(\frac{x}{s}\right) + z

where:

s=max(x)min(x)2b1,z=round(min(x)s)s = \frac{\max(x) - \min(x)}{2^b - 1}, \quad z = \text{round}\left(\frac{-\min(x)}{s}\right)

Dequantization:

x^=(xqz)×s\hat{x} = (x_q - z) \times s

The zero-point zz shifts the quantization grid so that the full [0,2b1][0, 2^b - 1] integer range maps exactly to the [min(x),max(x)][\min(x), \max(x)] range of the original values. This is more expressive -- we don't waste levels on a part of the range the weights don't use.

Drag the bit slider in the visualization below to see how quantization bins overlay onto a real weight distribution, and toggle between symmetric and asymmetric modes:

Per-Tensor vs. Per-Channel vs. Per-Group

The scale factor ss doesn't have to be computed over the entire tensor. Finer granularity means better approximation:

Per-tensor: One scale factor for the entire weight matrix. Cheapest to store (1 extra number per matrix), but the most lossy. A single outlier weight forces all other weights into a coarser grid.

Per-channel (per-row or per-column): One scale factor per output channel. For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}, we store mm scale factors instead of 1. Much better accuracy because each row can adapt to its own range.

Per-group: Split each row into groups of gg elements (typically g=128g = 128) and compute a separate scale for each group. This is the standard in modern 4-bit quantization (GPTQ, AWQ) because it balances storage overhead with quality.

python
import torch def per_group_quantize(weights: torch.Tensor, bits: int = 4, group_size: int = 128): """Quantize a weight matrix using per-group symmetric quantization.""" assert weights.dim() == 2 rows, cols = weights.shape assert cols % group_size == 0, f"cols ({cols}) must be divisible by group_size ({group_size})" # Reshape into groups w = weights.reshape(rows, -1, group_size) # (rows, num_groups, group_size) # Compute per-group scale factors qmax = (1 << (bits - 1)) - 1 # e.g., 7 for 4-bit abs_max = w.abs().amax(dim=-1, keepdim=True).clamp(min=1e-10) # (rows, num_groups, 1) scales = abs_max / qmax # (rows, num_groups, 1) # Quantize: round to nearest integer w_q = (w / scales).round().clamp(-qmax, qmax).to(torch.int8) # Dequantize w_deq = (w_q.float() * scales).reshape(rows, cols) return w_q, scales.squeeze(-1), w_deq # Example usage W = torch.randn(4096, 4096) # A typical weight matrix W_q, scales, W_approx = per_group_quantize(W, bits=4, group_size=128) mse = ((W - W_approx) ** 2).mean() print(f"MSE: {mse:.6f}") print(f"Scales shape: {scales.shape}") # (4096, 32) — 32 groups per row print(f"Memory: weights = {W_q.numel() * 4 / 8 / 1e6:.1f} MB, " f"scales = {scales.numel() * 2 / 1e6:.2f} MB")

The per-group overhead is small: for group size 128 in INT4, the scales add about 1.6% extra memory (one FP16 scale per 128 four-bit weights = 16 bits per 512 bits = 3.1%).

Calibration: Finding the Right Scale

A subtle but critical detail: how do you determine the range [min(x),max(x)][\min(x), \max(x)] to quantize into?

Min-max calibration: Use the actual min and max of the weight tensor. Simple, but outliers can stretch the range and waste precision on the bulk of the distribution.

Percentile calibration: Clip the range to the 99.99th percentile. Sacrifices accuracy on the outliers (which are rare) to improve accuracy on the rest.

MSE-optimal calibration: Search for the clipping threshold that minimizes the mean squared error between original and quantized weights. This is what most modern frameworks use.

Entropy calibration (KL divergence): Minimize the KL divergence between the original weight distribution and the quantized distribution. Used in TensorRT.

python
def mse_optimal_clip(weights: torch.Tensor, bits: int = 8, num_candidates: int = 200): """Find the clipping range that minimizes quantization MSE.""" abs_max = weights.abs().max().item() best_mse = float("inf") best_clip = abs_max for i in range(1, num_candidates + 1): clip_val = abs_max * i / num_candidates # Clip and quantize clipped = weights.clamp(-clip_val, clip_val) qmax = (1 << (bits - 1)) - 1 scale = clip_val / qmax quantized = (clipped / scale).round().clamp(-qmax, qmax) * scale mse = ((weights - quantized) ** 2).mean().item() if mse < best_mse: best_mse = mse best_clip = clip_val return best_clip, best_mse

Quantization-Aware Training (QAT)

PTQ works well for INT8 but struggles with INT4 -- the precision loss is too large. Quantization-Aware Training addresses this by inserting quantization into the training loop itself, allowing the model to learn weights that are robust to quantization noise.

The Core Problem: Rounding is Non-Differentiable

The quantize function contains round(), which has zero gradient almost everywhere (the derivative of a step function is zero between steps and undefined at steps). If we naively insert quantization into the forward pass, gradients won't flow through it, and training stops working.

The Straight-Through Estimator (STE)

The solution is a beautiful hack from Bengio et al. (2013). During the:

  • Forward pass: Apply real quantization. The network sees quantized weights and learns to work with them.
  • Backward pass: Pretend the quantization didn't happen. Pass the gradient straight through as if it were an identity function.

Formally, let Q(x)Q(x) be the quantize operation. The STE approximates:

Q(x)x1\frac{\partial Q(x)}{\partial x} \approx 1

This is mathematically unjustified -- the true gradient is zero -- but it works spectacularly well in practice. The intuition is that even though the gradient doesn't perfectly describe the quantized landscape, it points in roughly the right direction, and SGD is robust to noisy gradients.

Fake Quantization Nodes

In practice, QAT is implemented by inserting "fake quantization" nodes into the computation graph. These nodes quantize-then-dequantize in the forward pass (simulating the precision loss) but use the STE in the backward pass:

python
import torch import torch.nn as nn class FakeQuantize(torch.autograd.Function): """Fake quantization with straight-through estimator.""" @staticmethod def forward(ctx, x, scale, zero_point, qmin, qmax): # Quantize then immediately dequantize (simulate quantization loss) x_q = torch.clamp(torch.round(x / scale) + zero_point, qmin, qmax) x_deq = (x_q - zero_point) * scale return x_deq @staticmethod def backward(ctx, grad_output): # Straight-through: pass gradient unchanged return grad_output, None, None, None, None class QATLinear(nn.Module): """Linear layer with quantization-aware training.""" def __init__(self, in_features, out_features, bits=8): super().__init__() self.linear = nn.Linear(in_features, out_features) self.bits = bits self.qmin = -(1 << (bits - 1)) self.qmax = (1 << (bits - 1)) - 1 # Learnable scale (per-channel) self.scale = nn.Parameter(torch.ones(out_features)) self.zero_point = nn.Parameter(torch.zeros(out_features)) def forward(self, x): # Fake-quantize the weights w = self.linear.weight scale = self.scale.abs().clamp(min=1e-8).unsqueeze(1) zp = self.zero_point.round().unsqueeze(1) w_q = FakeQuantize.apply(w, scale, zp, self.qmin, self.qmax) return nn.functional.linear(x, w_q, self.linear.bias)

The training loop then looks normal -- the quantization noise is baked into every forward pass, and the model adapts:

python
model = SomeModel() # Replace linear layers with QATLinear optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) for batch in dataloader: loss = model(batch) # Forward pass uses fake-quantized weights loss.backward() # Backward pass uses STE optimizer.step() # Update the full-precision master weights optimizer.zero_grad() # After training, export the truly quantized model export_quantized(model)

QAT vs PTQ: When to Use Which

PTQ is preferred when:

  • You have a pre-trained model and can't afford retraining
  • You're targeting INT8 (where PTQ usually works fine)
  • Latency of deployment matters more than maximum accuracy

QAT is preferred when:

  • You need INT4 or lower precision
  • You're training from scratch anyway
  • Maximum accuracy at a given precision is critical

In practice, most deployed LLMs use PTQ with advanced methods (GPTQ, AWQ) that approach QAT quality without requiring retraining.

GPTQ: One-Shot Weight Quantization via Hessian Information

GPTQ (Frantar et al., 2022) is the method that made 4-bit LLM quantization practical. It achieves near-lossless INT4 quantization without any retraining, processing one layer at a time using second-order (Hessian) information to minimize quantization error.

The Optimal Brain Quantization Framework

GPTQ builds on a beautiful theoretical framework. The key insight: when you quantize a weight, you introduce an error. But you can compensate for that error by slightly adjusting the remaining unquantized weights.

Consider a single linear layer with weight matrix WW and a calibration dataset that produces input features XX. The layer's output is WXWX. After quantizing WW to W^\hat{W}, we want to minimize:

L=WXW^X22\mathcal{L} = \| WX - \hat{W}X \|_2^2

This is the output reconstruction error -- we don't care about individual weight errors, only about how well the layer's output is preserved.

For a single weight wijw_{ij}, the optimal quantization (accounting for the effect on all other weights via the Hessian H=2XXTH = 2XX^T) gives the update rule:

δj=wjquant(wj)[H1]jj(H1):,j\delta_j = -\frac{w_j - \text{quant}(w_j)}{[H^{-1}]_{jj}} \cdot (H^{-1})_{:,j}

where δj\delta_j is the correction applied to all remaining unquantized weights in the row after quantizing column jj. This is the optimal error compensation -- it redistributes the quantization error across the remaining weights to minimize the overall layer output error.

The Algorithm

GPTQ processes each row of the weight matrix independently, quantizing one column at a time:

python
def gptq_quantize_row(w_row, H_inv, bits=4, group_size=128): """ Quantize one row of a weight matrix using GPTQ. Args: w_row: (n,) - one row of the weight matrix H_inv: (n, n) - inverse of the Hessian (2 * X @ X^T) bits: target bit width group_size: per-group quantization granularity """ n = w_row.shape[0] w = w_row.clone() w_q = torch.zeros_like(w) qmax = (1 << (bits - 1)) - 1 errors = torch.zeros_like(w) for j in range(n): # Determine scale for current group group_idx = j // group_size group_start = group_idx * group_size group_end = min(group_start + group_size, n) group_max = w[group_start:group_end].abs().max().clamp(min=1e-10) scale = group_max / qmax # Quantize weight j q = torch.round(w[j] / scale).clamp(-qmax, qmax) w_q[j] = q * scale # Quantization error for weight j err = (w[j] - w_q[j]).item() # Compensate remaining weights using Hessian information if j + 1 < n: w[j+1:] -= err / H_inv[j, j] * H_inv[j, j+1:] return w_q

The full GPTQ algorithm:

  1. Collect calibration data: Run ~128 samples from the training set through the model to compute activation statistics.
  2. Compute the Hessian: For each layer, H=2XXTH = 2XX^T where XX is the input activation matrix.
  3. Compute the inverse Hessian: H1H^{-1} using Cholesky decomposition (numerically stable).
  4. Quantize column-by-column: For each row, quantize weights one at a time, applying the optimal error compensation after each quantization.

Why GPTQ Works So Well

The key insight is that not all weights are equally important. A weight connected to a frequently-activated feature has a larger effect on the output than a weight connected to a rarely-used feature. The Hessian H=2XXTH = 2XX^T captures this: features with high activation variance get large diagonal entries in HH, which means the algorithm is more conservative when quantizing their associated weights.

GPTQ processes a 7B model in about 4 hours on a single GPU. The resulting INT4 model typically has less than 0.5 perplexity increase over the FP16 baseline -- a remarkable achievement for a method that never retrains.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) takes a different approach to the same problem. Instead of compensating for errors after quantization (like GPTQ), AWQ identifies the most important weights before quantization and protects them.

The Observation: Not All Weights Are Equal

Consider a weight matrix WW and input activations XX. Some weights consistently multiply large activations -- these are salient weights. Quantizing them aggressively causes disproportionate output error. Other weights multiply near-zero activations and can be quantized aggressively with minimal impact.

AWQ identifies salient weight channels by looking at the activation magnitudes:

saliencyj=1Ni=1NXij\text{saliency}_j = \frac{1}{N} \sum_{i=1}^{N} |X_{ij}|

Channels with high average activation magnitude are salient. Typically, only 1-3% of channels are salient, but they're responsible for a disproportionate fraction of the output.

The Per-Channel Scaling Trick

The naive approach would be to keep salient weights in higher precision (mixed-precision), but that complicates hardware implementation. AWQ's insight is more elegant: scale the salient weight channels before quantization.

For a salient channel jj, multiply the weight column W:,jW_{:,j} by a scale factor sj>1s_j > 1 before quantization, and divide the corresponding activation column by sjs_j to preserve the output:

W:,jXj,:=(sjW:,j)(1sjXj,:)W_{:,j} X_{j,:} = (s_j \cdot W_{:,j}) \cdot \left(\frac{1}{s_j} \cdot X_{j,:}\right)

Mathematically, the output is identical. But the quantization error changes! Scaling up a weight before quantization makes the quantization grid finer relative to the weight's original magnitude, effectively giving salient weights higher precision.

The optimal scale factor for channel jj can be found by grid search:

sj=argminsjQ(sjW:,j)1sjXj,:W:,jXj,:22s_j^* = \arg\min_{s_j} \| Q(s_j \cdot W_{:,j}) \cdot \frac{1}{s_j} \cdot X_{j,:} - W_{:,j} \cdot X_{j,:} \|_2^2

In practice, AWQ uses a power function sj=(saliencyj)αs_j = (\text{saliency}_j)^\alpha where α\alpha is a hyperparameter (typically 0.5).

python
import torch def awq_scale_search(W, X, bits=4, group_size=128, grid_size=20): """ Search for optimal per-channel scales using AWQ. Args: W: (out_features, in_features) weight matrix X: (num_samples, in_features) calibration activations bits, group_size: quantization parameters """ out_f, in_f = W.shape # Compute per-channel activation saliency saliency = X.abs().mean(dim=0) # (in_features,) # Try different alpha values best_scales = torch.ones(in_f) best_error = float("inf") for alpha_100 in range(0, grid_size + 1): alpha = alpha_100 / grid_size # 0 to 1 scales = saliency.pow(alpha).clamp(min=1e-4) # Normalize so average scale is 1 (don't change overall magnitude) scales = scales / scales.mean() # Scale weights and quantize W_scaled = W * scales.unsqueeze(0) # Scale columns W_q = per_group_quantize_dequantize(W_scaled, bits, group_size) W_deq = W_q / scales.unsqueeze(0) # Unscale after quantization # Measure output error out_orig = X @ W.T out_quant = X @ W_deq.T error = ((out_orig - out_quant) ** 2).mean().item() if error < best_error: best_error = error best_scales = scales.clone() return best_scales, best_error def per_group_quantize_dequantize(W, bits, group_size): """Simple per-group symmetric quantize + dequantize.""" out_f, in_f = W.shape qmax = (1 << (bits - 1)) - 1 W = W.reshape(out_f, -1, group_size) scales = W.abs().amax(dim=-1, keepdim=True) / qmax W_q = (W / scales.clamp(min=1e-10)).round().clamp(-qmax, qmax) W_deq = (W_q * scales).reshape(out_f, in_f) return W_deq

AWQ vs GPTQ

GPTQAWQ
ApproachCompensate errors after quantizationProtect important weights before quantization
UsesHessian (second-order) informationActivation magnitude (first-order)
SpeedSlower (column-by-column processing)Faster (single forward pass per layer)
QualitySlightly better at 3-bitSlightly better at 4-bit
HardwareGeneralOptimized for GPU inference

In practice, both methods achieve comparable quality. The chart below compares them head-to-head against the naive round-to-nearest baseline:

The takeaway: at 4 bits, both GPTQ and AWQ achieve perplexity within 2-3% of FP16. Below 4 bits, the gap widens, and method choice matters more. Above 4 bits, even naive rounding works well enough.

GGUF and llama.cpp: Quantization for Everyone

Theory is beautiful, but what do practitioners actually use to run quantized models? The answer, for most people, is llama.cpp and its GGUF format.

The llama.cpp Revolution

In March 2023, Georgi Gerganov released llama.cpp -- a C/C++ implementation of LLaMA inference that runs on CPUs. No GPU required. No Python. No PyTorch. Just pure C code that loads quantized model weights and generates text.

This was revolutionary because it democratized LLM inference. Suddenly, anyone with a laptop could run a 7B model. The key enabler was quantization.

GGUF Quantization Types

GGUF (GPT-Generated Unified Format) supports multiple quantization levels. The naming convention encodes the method:

  • Q4_0: 4-bit quantization, method 0 (simplest, per-block symmetric)
  • Q4_K_M: 4-bit quantization, K-quants, Medium quality
  • Q5_K_M: 5-bit quantization, K-quants, Medium quality
  • Q8_0: 8-bit quantization, method 0

The "K-quants" (introduced in June 2023) use a sophisticated mixed-precision scheme: different layers of the model are quantized to different bit widths based on their sensitivity. Attention layers (which are more sensitive) get more bits; feedforward layers (which are more robust) get fewer bits.

The Quality-Size Tradeoff

Here are approximate benchmarks for LLaMA-2 7B quantized with llama.cpp:

Quant TypeBits (avg)Size (GB)PerplexityPerplexity Increase
FP1616.013.55.67-- (baseline)
Q8_08.07.25.67+0.00
Q5_K_M5.55.15.69+0.02
Q4_K_M4.84.45.73+0.06
Q4_04.54.05.98+0.31
Q3_K_M3.93.56.15+0.48
Q2_K3.43.06.87+1.20

The sweet spot: Q4_K_M offers 3x compression over FP16 with negligible quality loss. Q5_K_M is essentially lossless.

Running a Quantized Model

Using llama.cpp is remarkably simple:

Bash
# Download a GGUF model (e.g., from Hugging Face) wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf # Run inference ./llama-cli -m llama-2-7b.Q4_K_M.gguf \ -p "The future of AI is" \ -n 200 \ --temp 0.7

This runs on CPU with no GPU required. On an M2 MacBook Pro, you get ~30 tokens/second for a 7B Q4 model -- fast enough for interactive use.

Quantizing Your Own Models

Bash
# Convert a Hugging Face model to GGUF python convert_hf_to_gguf.py ./my-model/ --outfile my-model-f16.gguf # Quantize to Q4_K_M ./llama-quantize my-model-f16.gguf my-model-Q4_K_M.gguf Q4_K_M

Practical Quantization with bitsandbytes

For GPU inference via the Hugging Face ecosystem, bitsandbytes provides drop-in quantization:

8-bit Quantization (LLM.int8())

LLM.int8() (Dettmers et al., 2022) introduced a clever trick: outlier-aware mixed-precision decomposition. It discovered that transformer activations contain a small number of "outlier" features with very large magnitudes. These outliers break standard INT8 quantization.

The solution: detect outlier features (typically those with magnitude > 6), extract them, compute those in FP16, and compute everything else in INT8. The results are merged at the end.

python
import torch from transformers import AutoModelForCausalLM, AutoTokenizer # Load model in 8-bit precision model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", load_in_8bit=True, # Enable 8-bit quantization device_map="auto", # Automatically place on GPU ) tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Use normally -- quantization is transparent inputs = tokenizer("The meaning of life is", return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

4-bit Quantization (QLoRA)

bitsandbytes also supports 4-bit inference using the NormalFloat4 (NF4) data type, which is optimized for the normal distribution of neural network weights:

python
from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, # Compute in FP16 bnb_4bit_quant_type="nf4", # NormalFloat4 data type bnb_4bit_use_double_quant=True, # Quantize the quantization constants too ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config, device_map="auto", )

The bnb_4bit_use_double_quant=True flag is a neat trick: it quantizes the FP16 scale factors themselves to INT8, saving an additional ~0.4 bits per parameter. This is called double quantization and reduces the memory overhead of per-group scales from ~0.5 GB to ~0.125 GB for a 7B model.

Combining Quantization with LoRA for Fine-Tuning

The QLoRA technique (Dettmers et al., 2023) combines 4-bit quantization with Low-Rank Adaptation to enable fine-tuning a 65B model on a single 48GB GPU:

python
from peft import LoraConfig, get_peft_model # Load base model in 4-bit model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", quantization_config=bnb_config, device_map="auto", ) # Add LoRA adapters (these are trained in FP16/BF16) lora_config = LoraConfig( r=16, # LoRA rank lora_alpha=32, # Scaling factor target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Output: trainable params: 4,194,304 || all params: 3,540,389,888 || trainable%: 0.118%

The frozen 4-bit base model uses ~3.5 GB. The trainable LoRA parameters add ~16 MB. Optimizer states (Adam) for the LoRA parameters add ~32 MB. Total: under 4 GB for fine-tuning a 7B model. This is what makes it possible to fine-tune LLMs on consumer GPUs.

Advanced Topics

Activation Quantization

Everything we've discussed so far quantizes weights, which are static. But activations (the intermediate values that flow through the network during inference) also consume memory and bandwidth. Quantizing activations is harder because:

  1. Activations change with every input: Weights are fixed after training, so we can carefully calibrate quantization parameters. Activations are different for every token.
  2. Outlier features: Transformers produce activation outliers (values 100x larger than the mean) in certain channels. These are the same outliers that LLM.int8() handles.
  3. Dynamic range varies by layer: Early layers might have activations in [1,1][-1, 1]; later layers might have [100,100][-100, 100].

Dynamic quantization handles this by computing the scale factor on-the-fly for each inference step. The overhead is small (one max reduction per tensor) but the quality is much better than using static scales.

SmoothQuant: Making Activations Quantization-Friendly

SmoothQuant (Xiao et al., 2022) observes that weight distributions are easy to quantize (smooth, bell-shaped) while activation distributions are hard (spiky outliers). The solution: migrate the quantization difficulty from activations to weights using a mathematically equivalent transformation.

For a linear layer Y=XWY = XW, multiply activations by a diagonal matrix s1s^{-1} and weights by ss:

Y=XW=(Xdiag(s)1)(diag(s)W)Y = X W = (X \text{diag}(s)^{-1}) \cdot (\text{diag}(s) W)

Choose ss to balance the difficulty:

sj=max(Xj)αmax(Wj)1αs_j = \frac{\max(|X_j|)^\alpha}{\max(|W_j|)^{1-\alpha}}

where α[0,1]\alpha \in [0, 1] controls the balance (typically α=0.5\alpha = 0.5). This smooths the activation outliers by absorbing them into the weights, making both sides easier to quantize.

The Frontier: Sub-4-Bit Quantization

The newest research explores even more aggressive approaches:

1-bit quantization (BitNet): Microsoft's BitNet b1.58 uses ternary weights {1,0,1}\{-1, 0, 1\}, replacing all multiplications with additions. A 3B parameter model at 1.58 bits per weight uses under 1 GB of memory. The key insight is that models need to be trained from scratch with ternary constraints -- you cannot post-training quantize to 1 bit. The training uses the sign function in the forward pass with STE in the backward pass:

wternary=RoundClip(wγ+ϵ,1,1)w_{\text{ternary}} = \text{RoundClip}\left(\frac{w}{\gamma + \epsilon}, -1, 1\right)

where γ=1nmWij\gamma = \frac{1}{nm}\sum |W_{ij}| is the mean absolute value of the weight matrix. Early results show BitNet b1.58 matching FP16 performance at the same parameter count while being dramatically more efficient.

Mixed-precision quantization: Different layers and different weight matrices get different bit widths. Attention projections are more sensitive than FFN weights, so they get more bits. This is what GGUF K-quants implement. The allocation can be formulated as a constrained optimization:

minb1,,bLl=1LEl(bl)subject tol=1LnlblBtotal\min_{b_1, \ldots, b_L} \sum_{l=1}^{L} \mathcal{E}_l(b_l) \quad \text{subject to} \quad \sum_{l=1}^{L} n_l \cdot b_l \leq B_{\text{total}}

where El(bl)\mathcal{E}_l(b_l) is the quantization error of layer ll at blb_l bits, nln_l is the number of parameters in layer ll, and BtotalB_{\text{total}} is the total bit budget. In practice, this is solved greedily: start with uniform bit allocation and iteratively increase bits for the most sensitive layers.

Quantization + Pruning: Combine quantization (fewer bits) with pruning (fewer weights) for even greater compression. A 4-bit pruned model can be 10x smaller than the FP16 dense model. The SparseGPT algorithm prunes 50-60% of weights while simultaneously quantizing the remaining ones to INT4, achieving compression ratios of 8-16x.

QuIP (Quantization with Incoherence Processing): This method randomly rotates the weight matrix before quantization to spread out outlier structure, then applies lattice-based quantization in the rotated space. It achieves state-of-the-art results at 2 bits per weight.

Choosing a Quantization Strategy

With so many methods available, how do you choose? Here's a practical decision tree:

If you just want to run a model locally (no training):

  • Use GGUF + llama.cpp with Q4_K_M. This is the most battle-tested, widely compatible option. It works on CPU and GPU alike, and the quality is excellent.

If you want GPU inference via Hugging Face:

  • Use bitsandbytes with load_in_4bit=True and NF4 for the simplest setup.
  • Use AWQ or GPTQ (via AutoGPTQ) for the best quality at 4 bits.

If you want to fine-tune a quantized model:

  • Use QLoRA (bitsandbytes 4-bit + LoRA adapters). This is the standard approach for consumer-GPU fine-tuning.

If you need to deploy to production with maximum throughput:

  • Use AWQ with vLLM or TensorRT-LLM. These inference servers can batch requests efficiently with quantized models.

If you need INT8 with minimal effort:

  • PyTorch's built-in torch.ao.quantization or bitsandbytes load_in_8bit=True. INT8 is essentially lossless for all practical purposes.

A Quick Benchmark Guide

When evaluating quantization quality for your specific use case:

  1. Perplexity on a held-out set: The standard metric. Measure on WikiText-2 or your domain-specific data.
  2. Task-specific benchmarks: Perplexity doesn't capture everything. Run your actual downstream tasks (classification accuracy, BLEU scores, etc.).
  3. Needle-in-a-haystack: For long-context models, test whether quantization degrades the model's ability to retrieve information from long contexts.
  4. Qualitative evaluation: Generate samples and compare FP16 vs quantized. Sometimes subtle degradation (repetition, factual errors) doesn't show up in perplexity.
python
import torch from transformers import AutoModelForCausalLM, AutoTokenizer def measure_perplexity(model, tokenizer, text, stride=512): """Measure perplexity of a model on a text string.""" encodings = tokenizer(text, return_tensors="pt") input_ids = encodings.input_ids.to(model.device) seq_len = input_ids.size(1) nlls = [] for i in range(0, seq_len - 1, stride): begin = max(i + stride - 1024, 0) end = min(i + stride, seq_len) target_len = end - i - (begin - i if begin > i else 0) input_chunk = input_ids[:, begin:end] target_chunk = input_chunk.clone() target_chunk[:, :-target_len] = -100 # mask non-target tokens with torch.no_grad(): outputs = model(input_chunk, labels=target_chunk) nlls.append(outputs.loss * target_len) ppl = torch.exp(torch.stack(nlls).sum() / (seq_len - 1)) return ppl.item() # Compare FP16 vs INT4 model_fp16 = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16, device_map="auto") model_int4 = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", load_in_4bit=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") test_text = open("wikitext-test.txt").read() ppl_fp16 = measure_perplexity(model_fp16, tokenizer, test_text) ppl_int4 = measure_perplexity(model_int4, tokenizer, test_text) print(f"FP16 perplexity: {ppl_fp16:.2f}") print(f"INT4 perplexity: {ppl_int4:.2f}") print(f"Degradation: {((ppl_int4 - ppl_fp16) / ppl_fp16 * 100):.2f}%")

The Big Picture

Quantization is not a hack or a compromise -- it's a fundamental insight about the nature of neural networks. These models are massively over-parameterized relative to the precision they need. A weight of 0.03470.0347 and a weight of 0.03520.0352 produce effectively identical outputs. We can round aggressively without losing the signal.

The mathematical foundation is solid. For a well-trained model, the loss landscape around the optimal weights is flat in most directions (the Hessian has many near-zero eigenvalues). This flatness means you can perturb the weights substantially -- as quantization does -- without climbing out of the loss basin.

ΔL12δWTHδW\Delta \mathcal{L} \approx \frac{1}{2} \delta W^T H \delta W

When HH has small eigenvalues (as it does for over-parameterized models), even large perturbations δW\delta W produce small loss changes ΔL\Delta \mathcal{L}.

The practical impact is enormous:

  • LLaMA-2 70B goes from requiring 4x A100 GPUs (140 GB) to running on a single RTX 4090 (24 GB) with INT4 quantization
  • Inference cost drops by 3-4x from reduced memory bandwidth
  • Fine-tuning on consumer hardware becomes possible via QLoRA
  • Edge deployment (phones, laptops) becomes feasible
  • Mixtral 8x7B (47B total parameters) runs on a single 24GB GPU at Q4, something that was unthinkable two years ago

The field continues to advance rapidly. We're approaching the point where quantization is less "optional optimization" and more "default deployment practice." If you're deploying any model with more than a billion parameters, you should be quantizing it.

The 4-bit sweet spot -- negligible quality loss, 4x memory savings, 2-3x speed boost -- is one of the best free lunches in all of deep learning. And unlike most "free lunches," this one has solid theoretical and empirical foundations to explain why it works.