Native Sparse Attention: Hardware-Aligned Learned Sparsity
Part of the Paper × Code series — where we dissect papers and rebuild them from scratch.
Coming Soon. This post is currently being written. Check back soon for the full paper dissection with code implementations.
What We're Building
Paper: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Authors: DeepSeek-AI (February 2025)
Link: arxiv.org/abs/2502.11089
The mechanism at a glance:
- Three parallel attention paths: compressed global, selected important tokens, sliding window
- Learned routing that decides which tokens matter per query
- Hardware-aligned design: all paths maintain dense memory access patterns
- End-to-end trainable with no auxiliary losses
- ~15× reduction in attention scores computed at 128K context
We'll build each component, understand the design decisions, and see how they compose into a mechanism that's now the blueprint for frontier sparse attention.
Part 1: The Three Paths
Compressed Global Context
TODO: The compression module. Pooling consecutive tokens into summaries. Why dense compression (not sparse selection) for the global view. Dimensions and compute analysis.
Learned Token Selection (The Router)
TODO: How the router scores tokens for relevance. Top-k selection. Why content-aware sparsity beats fixed patterns. The straight-through estimator for gradients.
Sliding Window (The Reliable Local Path)
TODO: Why locality is always included. How it interacts with the other two paths. The learned gate that weights all three contributions.
Code: Three-Path Attention
TODO: Full PyTorch implementation of the combined mechanism.
Part 2: Hardware-Aligned Design
Why Previous Sparse Methods Were Slow
TODO: The scattered memory access problem. Why saving FLOPs ≠ saving wall-clock time. GPU memory hierarchy (HBM → L2 → SRAM).
NSA's Solution: Dense Buffers
TODO: How gather-into-contiguous-buffer before attention makes each path a standard dense matmul. Why this is critical for achieving actual speedup.
Code: Efficient Implementation
TODO: The gather and scatter operations. Memory layout decisions.
Part 3: Training Dynamics
End-to-End Trainability
TODO: How gradients flow through the router. The discrete selection problem and relaxations.
No Auxiliary Losses Needed
TODO: Why NSA doesn't need load-balancing losses (unlike MoE routing). Natural emergence of balanced selection.
Scaling Behavior
TODO: How NSA's efficiency gains scale with sequence length. When it breaks even vs. Flash Attention.
Part 4: NSA in Context
The Lineage
TODO: How NSA synthesizes ideas from Sparse Transformer (factorized patterns), sliding window (reliable locality), MLA (compression), and MoE (learned routing).
Who's Using NSA-Style Attention
TODO: DeepSeek-V4's CSA/HCA, GLM-5's Lightning Indexer, Qwen 3 Next's hybrid design — all descendants of the NSA blueprint.
The Results
TODO: Benchmark comparisons, throughput numbers, quality vs. full attention.
Key Takeaways
TODO: When to use NSA-style attention. What's practical at smaller scales. The minimum sequence length where it pays off.