Generating a single 1,000-token response with a naive transformer requires roughly 500,000 matrix multiplications that produce output you immediately throw away. That number is not a rounding error - it is the actual cost of recomputing the same vectors over and over, for tokens that have not changed, because nobody told the model it was allowed to remember them.
Parallel processing is only possible during training, when you already have the entire document. During inference, the model generates one token at a time - it cannot know token 600 until it has committed to tokens 1 through 599. This is called autoregressive generation, and it creates a mismatch between how transformers are designed and how they are deployed.
The Autoregressive Loop
Start with a concrete example. The model receives "The cat sat" and must continue.
Step 1: Feed ["The", "cat", "sat"] into the model. Attention is computed for all 3 tokens. Output: "on".
Step 2: Feed ["The", "cat", "sat", "on"] - all 4 tokens. Attention is recomputed for every token. Output: "the".
Step 3: Feed ["The", "cat", "sat", "on", "the"] - all 5. Every token again. Output: "mat".
At step 2, the model recalculates the Q, K, V projections for "The", "cat", and "sat" from scratch. Nothing about those tokens has changed - their embeddings are identical to step 1. At step 3, all of them get recomputed again. Token 1 will be recomputed a total of times across an -token generation.
The total work sums to , which is . Double the generation length, quadruple the compute. This is why, without caching, LLM inference at 4K+ contexts was essentially unusable.
Keys and Values Do Not Change
To see why caching is possible, look at the attention formula:
During generation, the only thing the model needs to produce is the attention output for the newest token. What does that token depend on?
- Its own Query (): computed fresh because it is a new token.
- All Keys (): the newest token compares its query against the keys of every previous token. But depends only on , which is fixed once token has been generated.
- All Values (): same argument. does not change after token exists.
The word "cat" produces the same K and V vector whether we are at step 3 or step 900. Every step, we recompute them from scratch and use them once. The fix is obvious in retrospect: compute them once, store the result, and look it up. This is the KV Cache.
How KV Caching Works
Prefill phase
When the user sends a prompt, the model processes all prompt tokens in parallel - just like training. This prefill step computes Q, K, V for every prompt token simultaneously and populates the initial cache. The quadratic cost of attention still applies here, but it happens only once and scales with prompt length, not generation length.
Decode phase
For each new token generated:
- Compute only - the query for the new token (a single vector, not a matrix)
- Compute and - the key and value for the new token
- Append and to the cached and
- Compute attention:
The Q matrix has shrunk from to . We eliminated re-projection of all past tokens. Per-step projection cost drops from to , and the total cost across all steps from to .
Step through the visualization below to see it in action:
Implementation
The naive approach recomputes everything:
With KV caching, we only process the new token:
The generation loop becomes:
Notice what kv_cache is: a list of (K, V) tensor pairs, one per transformer layer. Every layer maintains its own cache because every layer has its own K/V projections. The cache grows by one row per layer per generated token.
The Memory Wall
KV caching trades a compute problem for a memory problem. The compute savings are real and dramatic. The memory cost is also real and dramatic, and at scale, it becomes the binding constraint.
How big is the cache?
The KV cache stores two tensors (K and V) for every layer, every attention head, at every position in the sequence:
Where = layers, = heads, = head dimension, = sequence length, = batch size, and dtype is typically fp16 (2 bytes).
Concrete numbers:
2 GB for a 7B model seems manageable. But scale up the sequence length or batch size and things get dire fast:
At batch size 8 with a 32K context window, even a 7B model's KV cache exceeds the memory of a consumer GPU. A 70B model at 128K tokens? The cache alone needs hundreds of gigabytes - more than the model weights themselves.
This is the memory wall. The KV cache becomes the dominant memory consumer, and it determines how many users a single GPU can serve simultaneously. A GPU that can hold the 70B weights and produce one slow response might be able to serve zero concurrent users if there is no memory left for their caches.
The compute problem was solved by storing more. The memory problem cannot be solved by storing more - it must be attacked by storing less, or storing smarter.
Attacking the Memory Wall
Multi-Query and Grouped-Query Attention
Standard multi-head attention gives each head its own K and V projections. Multi-Query Attention (MQA) (Shazeer, 2019) shares a single K/V head across all query heads. The cache shrinks by a factor of - the number of heads. The quality hit is real but often acceptable.
Grouped-Query Attention (GQA) (Ainslie et al., 2023) finds a middle ground: groups of query heads share K/V heads. Llama 2 70B uses GQA with 8 KV heads shared across 64 query heads, reducing cache size by 8x with less quality degradation than full MQA.
The cache is just the stored K and V tensors. Fewer KV heads means fewer tensors to store - the query heads still exist and still attend, they just all read from a shared set of keys and values.
FlashAttention
FlashAttention (Dao et al., 2022) does not eliminate the cache, but makes the attention computation itself dramatically faster by being IO-aware. Standard attention materializes the full attention score matrix in GPU high-bandwidth memory (HBM). For long sequences, that matrix dominates memory bandwidth - you spend most of your time reading and writing it, not computing.
FlashAttention tiles the computation into blocks that fit in fast on-chip SRAM, never materializing the full matrix. The result is 2-4x faster attention and significantly lower peak memory, with mathematically identical output.
PagedAttention
Traditional KV caching pre-allocates a contiguous block of memory for the maximum possible sequence length. A request that terminates early leaves most of that allocation unused. With hundreds of concurrent requests, the fragmentation is massive - you might have 40% of your GPU memory reserved but unreachable.
PagedAttention (Kwon et al., 2023), implemented in vLLM, treats KV cache like virtual memory. Cache is stored in non-contiguous pages that are allocated on demand. Pages can be shared across requests that share a common prefix (system prompts, few-shot examples). Fragmentation drops to near zero. Throughput nearly doubles on realistic serving workloads - not because the math changed, but because the allocator got smarter.
Quantized KV Caches
The cache is in fp16 because attention arithmetic is sensitive to precision. But how sensitive, exactly? If K and V can be quantized to int8 or int4 with acceptable quality loss, the cache shrinks by 2-4x. Research (Hooper et al., 2024) shows models tolerate K quantization better than V quantization, because errors in V accumulate multiplicatively through the weighted sum. Some production systems run K at int8 and V at fp16 as a pragmatic compromise.
What This Means in Practice
Every technique above is a different angle of attack on the same constraint: a fixed amount of GPU memory that must be shared between model weights, KV cache, and activations. The KV cache, unlike the weights, grows with sequence length and batch size - which means at high throughput, it crowds everything else out.
A 70B model serving 100 concurrent users at 32K context each has a KV cache footprint in the terabytes. No single GPU holds that. Production systems shard it across devices, evict cold pages to CPU memory, and batch requests that share prefixes to amortize cache cost. The inference engine is not just attention math running fast - it is a memory allocator that happens to also compute attention.
Understanding KV caching is understanding why inference infrastructure is hard. The compute was always tractable. The memory is what scales badly.
