Part 5 of 8
The Gradient Descent through Transformers
Attention Part 2 — The Sharing Lineage: From Multi-Query to Multi-Latent
This is Part 5 of The Gradient Descent through Transformers — a series where I walk through every component of the modern transformer stack, how it evolved from 2017 to 2026, and why each piece matters.
Previously: Attention Part 1 — The Mechanism That Changed Everything
In Part 1, we built self-attention from scratch and saw how multi-head attention lets a model learn diverse relationship patterns in parallel. We ended with a problem: the KV cache grows linearly with sequence length and number of heads, and the attention matrix grows quadratically with sequence length. At GPT-3 scale, that's 4.8 billion cached values per forward pass.
This post traces how the field solved that problem — not by changing the hardware, but by redesigning the attention mechanism itself. We'll follow two evolutionary threads:
- The head-sharing lineage: MQA (2019) → GQA (2023) → CLA (2024) → MLA (2024) — reducing memory by sharing or compressing KV across heads and layers
- The sparsity lineage: Sparse Transformer (2019) → Sliding Window (2020/2023) → Global+Local Hybrid (2025) — reducing compute by limiting which token pairs interact
These threads eventually converge in 2024-2025 with mechanisms like Native Sparse Attention that combine both ideas. Let's trace the full history.
The Two Costs, Quantified
In Part 1, we explained why the KV cache exists and why it's expensive — and we saw how the quadratic score matrix creates a training-time wall. Here, let's quantify both costs precisely so we can measure the improvements.
Cost 1: KV Cache During Inference
The KV cache size per layer:
where is sequence length and is the per-head dimension. For a 32-head model with serving a 32K-token sequence:
In fp16, that's 512 MB per layer. With 80 layers (Llama 3 405B), that's 40 GB of KV cache alone — just for a single sequence. Serve 8 sequences in parallel and you need 320 GB just for caches, before any model weights.
Cost 2: Quadratic Score Matrix During Training
The attention score matrix is . At 128K tokens, that's 16.4 billion entries per head per layer. This dominates during training/prefill (when you process the full sequence at once), though during generation each step only adds one row.
Which Matters More?
At inference time (serving): KV cache dominates. This is what you pay for 24/7.
At training time: The quadratic score matrix dominates. But you train once and serve forever.
The architectural innovations below address both, but KV cache reduction has been the higher-impact thread for production deployments.
Thread 1: The Head-Sharing Lineage
Multi-Query Attention (Shazeer, 2019)
The first big insight was embarrassingly simple. Noam Shazeer (one of the original Transformer paper authors) asked: what if all query heads share the same K and V?
In standard multi-head attention, each head has its own projections:
MQA keeps separate query projections but uses a single shared K and V:
Every query head computes attention against the same keys and values.
Step through the full MQA transformation below. Watch how K and V go from 4 separate projections to a single shared one, while Q stays untouched:
interactive
Multi-Query Attention — Step by Step
In standard Multi-Head Attention, each of the 4 heads has its own separate WQ, WK, and WV projection. For our example (dmodel=8, 4 heads, dk=2):
Query projections (4 separate):
Key projections (4 separate):
Value projections (4 separate):
KV cache per token: 4 K heads + 4 V heads = 8 vectors of size 2 = 16 values/token
KV parameters: 4×(8×2) + 4×(8×2) = 128 parameters
The Memory Savings
With 32 query heads and standard multi-head attention, you cache 32 K matrices and 32 V matrices. With MQA, you cache 1 K and 1 V — a 32x reduction in KV cache size.
For our earlier example (32K sequence, 80 layers):
- Standard MHA: 40 GB KV cache
- MQA: 1.25 GB KV cache
That's the difference between needing a cluster and fitting on a single GPU.
The Quality Cost
MQA doesn't come for free. Sharing KV across all heads means every head sees the same keys and values — they can only differ in what they query for. This limits the diversity of attention patterns the model can learn. In practice:
- Decoder-only models tolerate MQA well — most of the "diversity" comes from the query projections anyway
- Smaller models suffer more quality degradation than larger ones
- The quality gap is small but measurable on benchmarks
Who Used MQA
- PaLM (Google, 2022) — 540B parameter model, used MQA throughout
- Falcon (TII, 2023) — 7B/40B/180B models
- StarCoder (BigCode, 2023) — code generation model
MQA proved the concept, but the quality trade-off motivated a middle ground.
Grouped-Query Attention (Ainslie et al., 2023)
GQA asks: what if instead of sharing K/V across ALL heads (MQA) or giving every head its own K/V (MHA), we group query heads and share K/V within each group?
With query heads divided into groups:
- Each group of query heads shares one K head and one V head
- When : this is MQA (one shared KV for all)
- When : this is standard MHA (every head has its own KV)
- In between: or gives significant savings with minimal quality loss
Explore the spectrum below — see how query heads get grouped, how K/V is shared within groups, and how the memory savings scale with different group counts:
interactive
Grouped-Query Attention — Step by Step
GQA sits on a spectrum between MHA (every head has its own KV) and MQA (all heads share one KV). The parameter g (number of KV groups) controls where you land:
MHA
g = h
MQA
g = 1
The Math
For query heads with groups:
- Each group has 4 query heads sharing 1 K and 1 V head
- KV cache: (4x smaller than MHA, 8x larger than MQA)
- Quality: nearly indistinguishable from MHA on most benchmarks
Why GQA Won
The paper showed something remarkable: you can take a model trained with MHA and uptrain it to use GQA with relatively little compute (5% of original training). This meant existing expensive models could be converted without full retraining.
More importantly, the quality-memory trade-off at turned out to be nearly Pareto-optimal:
- MHA: full quality, full memory cost
- GQA (): 99.5% quality, 25% memory cost
- MQA: 98-99% quality, 3% memory cost
That 0.5% quality difference matters less than the 4x memory savings for most deployments.
Who Uses GQA
Nearly everyone, as of 2024-2025:
- Llama 2 70B — the first major adoption (8 KV heads for 64 query heads)
- Llama 3 all sizes — 8B, 70B, and 405B all use GQA
- Mistral 7B / Small / Medium — 8 KV groups
- Gemma 1, 2, 3 — all sizes
- Qwen 2.5 / Qwen 3 — all sizes
- Command-R / R+ (Cohere)
GQA is the de facto standard. If you're building a new LLM in 2025 and don't have a strong reason to do otherwise, you use GQA.
Cross-Layer Attention (Brandon et al., MIT CSAIL, 2024)
GQA shares KV across heads within a layer. Cross-Layer Attention takes the same idea in a different direction: share KV across layers.
The observation: in deep transformers, adjacent layers often compute very similar K and V representations. If layer 12's keys are nearly identical to layer 11's, why compute and cache them separately?
The Mechanism
CLA defines a sharing factor that determines how many adjacent layers reuse the same KV cache:
- CLA2 (sharing factor 2): layers 2, 4, 6, ... reuse KV from layers 1, 3, 5, ... → 2× cache reduction
- CLA3 (sharing factor 3): every 3rd layer computes fresh KV, the other 2 reuse it → 3× cache reduction
- CLA4 (sharing factor 4): every 4th layer computes fresh KV → 4× cache reduction
The implementation is trivial — during forward pass, a layer either computes its own K, V projections normally, or simply indexes into the KV cache of a designated earlier layer. No new parameters, no architectural changes to the attention heads themselves.
Why This Fits the Sharing Lineage
Think of it as extending the "sharing" philosophy across a new axis:
- MQA: share KV across all heads (within a layer)
- GQA: share KV across groups of heads (within a layer)
- CLA: share KV across adjacent layers (orthogonal to head sharing)
The key property: CLA is orthogonal to head-sharing methods. You can combine GQA + CLA for multiplicative savings: 8× from GQA (4 groups → 8 heads sharing 1 KV set) × 2× from CLA2 = 16× total reduction over MHA.
Experimental Results
The paper tests CLA at 1B and 3B parameter scales, training on 100B tokens of RedPajama. The headline finding:
CLA2 is near-lossless. At 1B parameters, CLA2 adds only +0.04 perplexity compared to the non-sharing baseline — statistically negligible for a 2× cache reduction. At 3B parameters, CLA2 actually outperforms the baseline (lower perplexity), likely because the shared KV acts as a form of regularization at larger scales.
Higher sharing factors degrade more noticeably:
- CLA3 at 1B: +0.20 perplexity (still usable)
- CLA4 at 1B: +0.58 perplexity (significant degradation)
The sweet spot is clear: CLA2 gives you a free 2× reduction.
Composing with MQA and GQA
The paper's most practical finding: MQA + CLA2 is the recommended configuration. It achieves Pareto-optimal results — better quality-per-memory than either method alone. The combination makes intuitive sense:
- MQA aggressively compresses within each layer (1 KV head)
- CLA2 halves the number of layers that need their own cache
At 1B parameters, MQA + CLA2 achieves comparable perplexity to standalone GQA while using less total KV cache memory.
Limitations and Current Status
CLA has not been adopted in any major production model as of mid-2025. Several factors explain this:
- Scale uncertainty: all experiments are at 1-3B parameters. Whether CLA2 remains lossless at 70B+ is unverified.
- Training considerations: the paper notes that higher learning rates improve CLA performance, suggesting the optimization landscape changes — this adds tuning complexity.
- Pipeline parallelism constraint: if layers sharing KV are split across different devices, the shared cache requires cross-device communication, partially negating the memory savings.
- No serving benchmarks: the paper reports perplexity but no wall-clock latency or throughput measurements, which matter most in production.
- Competition from MLA: DeepSeek's MLA (discussed next) achieves even larger compression ratios with proven quality at scale.
Despite this, CLA's simplicity makes it a compelling option for resource-constrained settings. It requires zero new parameters, composes with any head-sharing method, and the implementation is a one-line change in the forward pass. If you're training a model under 10B parameters and want to push KV cache savings beyond GQA alone, CLA2 is the lowest-risk addition.
Multi-Latent Attention (DeepSeek-V2, 2024)
Every technique we've seen so far — MQA, GQA, CLA — works by sharing. You have one set of keys and values, and multiple consumers look at the same thing. It works, but there's an inherent tension that we should confront directly.
Remember why multi-head attention exists in the first place? The whole point of having multiple heads was to let the model learn diverse representations — one head tracks syntax, another tracks coreference, another tracks semantic similarity. Multiple heads = multiple perspectives on the same data. That's the fundamental value proposition of MHA.
And then MQA/GQA come along and say: "let's make all those heads look at the same keys and values." We added heads for diversity, and then we removed the very thing that makes them diverse. It's like hiring specialists and then giving them all the same briefing document — they can ask different questions, but they're all limited to the same information source.
What if we could approach the problem from a completely different angle?
The Motivation: Sharing vs. Compression
Consider what GQA actually sacrifices concretely. When 4 query heads share one K,V pair, those heads must all attend to the same information. Head 1 might want to track syntactic structure while head 2 wants semantic similarity — too bad, they're both limited to the same keys. The quality loss from GQA isn't random noise; it's a systematic reduction in the model's ability to represent multiple types of relevance simultaneously.
Now here's the key insight from DeepSeek's team: the K and V vectors we're storing are massively redundant. In a model with d_model=5120, each token produces a 5120-dimensional K vector. But the actual information content of that vector — the bits that actually matter for attention computation — lives on a much lower-dimensional manifold. Think of it like a 4K image of a blue sky: technically millions of pixels, but almost all of it can be described by "blue, with a slight gradient."
This is the same principle behind PCA, autoencoders, JPEG, and LoRA. High-dimensional representations are compressible because real data has structure.
The MLA proposal: instead of storing fewer KV sets (sharing), store smaller KV representations (compression). Each head still gets its own unique K and V, but those are reconstructed on-the-fly from a tiny compressed vector.
How MLA Works
Instead of caching the full K and V vectors, MLA caches a compressed latent vector for each token:
where . During attention computation, each head reconstructs its own K and V using head-specific up-projections:
The per-token KV cache is now just the latent vector , which is dramatically smaller. And because each head has its own , it reconstructs different keys — full per-head diversity from a shared compressed representation.
The "Absorbed" Trick: Why Decompression Is Free
The obvious concern: doesn't reconstructing K for every cached token at every step add massive compute? Here's DeepSeek's clever realization. The attention score computation is:
By the associativity of matrix multiplication, you can fold the decompression into the query. Define — this is a one-time transform of the current query vector (cheap: one vector, not the entire cache). Then compute scores as — directly between the transformed query and the compressed latents. The full K vector is never materialized. Same trick works for values.
The decompression matrix "absorbs" into the query, making it computationally equivalent to standard attention against a smaller cache.
The Numbers
DeepSeek-V2 uses with . That's a 10× compression ratio. Compared to the alternatives:
- MHA (32 heads): stores values per token
- GQA (8 groups): stores values per token
- MLA: stores values per token
4× better than GQA, 16× better than MHA, while maintaining per-head diversity that GQA sacrifices. DeepSeek-V2 matches or beats Llama 2 70B quality with 21B active parameters and a fraction of the KV cache.
The Trade-offs
MLA isn't free lunch — it trades memory for other constraints:
- Low-rank bottleneck: The compression is lossy. If some critical information lives in dimensions that the compression discards, it's gone. You're betting that captures "enough" of what contained.
- RoPE incompatibility: The absorption trick breaks with standard RoPE. Remember, RoPE applies position-dependent rotations to Q and K after projection — but absorption requires folding W_up into Q before the dot product. If K has position information baked in, you can't absorb it cleanly. DeepSeek's solution: decouple positional encoding from the compressed latent by adding a small separate "RoPE key" that carries only positional information, while the latent carries content. This adds architectural complexity and a small amount of extra cache (the position keys).
- Implementation complexity: The absorbed formulation with decoupled RoPE requires custom CUDA kernels. You can't just swap this into a standard attention implementation.
- Parallelism challenges: GQA maps cleanly onto tensor parallelism (split by groups). MLA's latent doesn't partition as naturally across devices.
- Training cost: The model must learn good compression/decompression matrices, adding optimization complexity.
But the results speak: DeepSeek-V3 (671B MoE, December 2024) uses MLA and achieved frontier-level performance, validating the approach at massive scale. It's the most aggressive KV cache compression in any production model.
interactive
Multi-Latent Attention — From Intuition to Mechanism
MQA and GQA save memory by sharing — multiple query heads look at the same K, V. But sharing has a fundamental limit: shared heads see identical information.
GQA (4 heads, 2 groups)
Q₁ and Q₂ must attend to the same keys/values
The dilemma
Can we reduce memory without forcing heads to see identical information?
The question MLA asks: What if instead of storing fewer KV sets (sharing), we store smaller KV representations (compression)? Each head gets unique information, but we store it in a compact form.
Want to go deeper? MLA has more subtleties — the decoupled RoPE design, the joint compression of K and V, and how it interacts with MoE routing. We rebuild the entire architecture from scratch in: Building DeepSeek-V3 from Ground Up →
What Models Actually Use
| Model | Year | KV Strategy | Compression |
|---|---|---|---|
| GPT-3 | 2020 | Full MHA | None |
| PaLM | 2022 | MQA | 32× via full sharing |
| Falcon 40B | 2023 | MQA | 32× via full sharing |
| Llama 2 70B | 2023 | GQA (8 KV heads) | 4× via grouped sharing |
| Llama 3 / 3.1 | 2024 | GQA (8 KV heads) | 4× via grouped sharing |
| DeepSeek-V2 | 2024 | MLA (d_c=512) | 16× via latent compression |
| Phi-4-mini | 2024 | GQA | 4× via grouped sharing |
| Qwen 3 | 2025 | GQA | 4× via grouped sharing |
| DeepSeek-V3 / R1 | 2024-25 | MLA | 16× via latent compression |
| GLM-5 / 5.1 | 2025 | MLA-256 | ~93% cache reduction |
| Mistral Large 3 | 2025 | MLA-style | Latent compression + MoE |
| Zyphra Zaya1-8B | 2025 | CCA (latent space) | 8× via latent compression |
| Gemma 4 | 2025 | GQA + Shared KV (CLA-like) | Cross-layer reuse in later layers |
| Llama 4 | 2025 | GQA | 4× via grouped sharing |
The Pattern
Three tiers have emerged:
- Standard choice (most models): GQA with 8 groups. Simple, well-understood, good enough. If you're building a new LLM and don't have a strong reason to do otherwise, this is the default.
- Latent compression (growing fast): MLA and variants (GLM-5, Mistral Large 3, Zyphra). Best cache compression with full per-head diversity. DeepSeek proved it at scale, and others are now following.
- Cross-layer sharing (emerging): Gemma 4 reuses KV states from earlier layers — the CLA idea we discussed, now validated in a production model.
CLA remains unproven at scale but may appear in future models as a low-risk addition on top of GQA.
The Trade-off Landscape
There's no free lunch. Every approach in the sharing lineage sacrifices something:
MQA: Maximum memory savings, but all heads see identical KV — limits pattern diversity. Small quality loss that compounds in reasoning tasks.
GQA: Balanced trade-off. Minimal quality loss, significant memory savings. The "safe default" that's hard to beat for general-purpose models.
CLA: Free 2× reduction with near-zero quality loss at tested scales (≤3B). Unproven beyond that. Composes well with everything else.
MLA: Best compression ratio with full per-head diversity, but complex to implement correctly, requires decoupled RoPE, custom CUDA kernels, and careful training. The reward is 4× better than GQA — if you can pay the engineering cost.
What's Coming Next: Attacking the Quadratic Wall
Everything in this post reduces the memory cost of attention — making the KV cache smaller so you can serve longer sequences and bigger batches. But there's a second bottleneck we haven't touched: compute.
Even with MLA's compressed cache, the attention computation itself is still in sequence length. Every token still attends to every other token (or at least, every cached latent). At 128K context, that's 16 billion attention score computations per layer.
The next generation of innovations attacks this directly: what if most tokens don't need to attend to most other tokens? Sparse attention patterns — sliding windows, global/local hybrids, learned routing — can reduce the quadratic cost to near-linear, without fundamentally changing the attention mechanism.
That's the subject of Part 3: The Sparsity Lineage — from Sparse Transformer's fixed patterns (2019) to Mistral's sliding window, Gemma 3's hybrid layers, and DeepSeek's learned Native Sparse Attention.
References
- Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019) — The MQA paper. Short, elegant, and foundational.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023) — The GQA paper, including the uptraining approach.
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (DeepSeek, 2024) — Introduces Multi-Latent Attention.
- Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (Brandon et al., 2024) — Cross-Layer Attention from MIT CSAIL.
Next in the series: Attention Part 3 — The Sparsity Lineage — sliding windows, global/local hybrids, differential attention, and learned sparse patterns. How the field made attention sub-quadratic without losing what matters.
This post is part of The Gradient Descent through Transformers — a series dissecting every component of the modern transformer stack.