Native Sparse Attention: Hardware-Aligned Learned Sparsity

Part of the Paper × Code series — where we dissect papers and rebuild them from scratch.

Coming Soon. This post is currently being written. Check back soon for the full paper dissection with code implementations.

What We're Building

Paper: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Authors: DeepSeek-AI (February 2025)
Link: arxiv.org/abs/2502.11089

The mechanism at a glance:

Three parallel attention paths: compressed global, selected important tokens, sliding window
Learned routing that decides which tokens matter per query
Hardware-aligned design: all paths maintain dense memory access patterns
End-to-end trainable with no auxiliary losses
~15× reduction in attention scores computed at 128K context

We'll build each component, understand the design decisions, and see how they compose into a mechanism that's now the blueprint for frontier sparse attention.

Part 1: The Three Paths

Compressed Global Context

TODO: The compression module. Pooling consecutive tokens into summaries. Why dense compression (not sparse selection) for the global view. Dimensions and compute analysis.

Learned Token Selection (The Router)

TODO: How the router scores tokens for relevance. Top-k selection. Why content-aware sparsity beats fixed patterns. The straight-through estimator for gradients.

Sliding Window (The Reliable Local Path)

TODO: Why locality is always included. How it interacts with the other two paths. The learned gate that weights all three contributions.

Code: Three-Path Attention

TODO: Full PyTorch implementation of the combined mechanism.

Part 2: Hardware-Aligned Design

Why Previous Sparse Methods Were Slow

TODO: The scattered memory access problem. Why saving FLOPs ≠ saving wall-clock time. GPU memory hierarchy (HBM → L2 → SRAM).

NSA's Solution: Dense Buffers

TODO: How gather-into-contiguous-buffer before attention makes each path a standard dense matmul. Why this is critical for achieving actual speedup.

Code: Efficient Implementation

TODO: The gather and scatter operations. Memory layout decisions.

Part 3: Training Dynamics

End-to-End Trainability

TODO: How gradients flow through the router. The discrete selection problem and relaxations.

No Auxiliary Losses Needed

TODO: Why NSA doesn't need load-balancing losses (unlike MoE routing). Natural emergence of balanced selection.

Scaling Behavior

TODO: How NSA's efficiency gains scale with sequence length. When it breaks even vs. Flash Attention.

Part 4: NSA in Context

The Lineage

TODO: How NSA synthesizes ideas from Sparse Transformer (factorized patterns), sliding window (reliable locality), MLA (compression), and MoE (learned routing).

Who's Using NSA-Style Attention

TODO: DeepSeek-V4's CSA/HCA, GLM-5's Lightning Indexer, Qwen 3 Next's hybrid design — all descendants of the NSA blueprint.

The Results

TODO: Benchmark comparisons, throughput numbers, quality vs. full attention.

Key Takeaways

TODO: When to use NSA-style attention. What's practical at smaller scales. The minimum sequence length where it pays off.