Building DeepSeek-V3 from Ground Up

Part of the Paper × Code series — where we dissect papers and rebuild them from scratch.

Coming Soon. This post is currently being written. Check back soon for the full paper dissection with code implementations.

What We're Building

Paper: DeepSeek-V3: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors: DeepSeek-AI (December 2024)
Link: arxiv.org/abs/2412.19437

The architecture at a glance:

671B total parameters, 37B active per token
Multi-Latent Attention (MLA) — compressed KV cache with absorbed projections
DeepSeekMoE — fine-grained experts with auxiliary-loss-free load balancing
Multi-Token Prediction (MTP) — predicting multiple future tokens per position
FP8 mixed-precision training

We'll build each piece, understand why it exists, and see how they compose into the full model.

Part 1: Multi-Latent Attention (MLA)

The Compression Pathway

TODO: Joint KV compression into latent c_t. W_DKV down-projection. Why joint (not separate K, V compression).

The Absorption Trick — Full Derivation

TODO: Complete math showing how W_UK absorbs into W_Q. Dimensions at each step.

The RoPE Problem

TODO: Why RoPE breaks absorption. Position-dependent rotation applied after projection means c_t can't carry position info through the absorbed path.

Decoupled RoPE Design

TODO: The two-pathway solution. Content keys (from latent, absorbed) + RoPE keys (small, cached separately). How the attention score combines both.

Code: MLA Layer

TODO: Full PyTorch implementation with comments.

Part 2: DeepSeekMoE

Why Fine-Grained Experts?

TODO: The argument for many small experts over few large ones. Combinatorial flexibility.

Shared Experts + Routed Experts

TODO: Always-active shared experts provide baseline capability. Top-k routed experts provide specialization.

Auxiliary-Loss-Free Load Balancing

TODO: The problem with auxiliary losses (they hurt model quality). DeepSeek's bias-term approach. Dynamic adjustment.

Code: MoE Layer with Routing

TODO: Full PyTorch implementation — router, expert dispatch, load balancing.

Part 3: Multi-Token Prediction (MTP)

The Idea

TODO: Predicting multiple future tokens per position during training. How it provides richer training signal.

Architecture for MTP

TODO: The sequential prediction modules. How they share the main model's representations.

MTP as Speculative Decoding

TODO: How MTP heads double as draft models during inference for free speculative decoding.

Code: MTP Head

TODO: Implementation of the MTP training and inference logic.

Part 4: FP8 Mixed-Precision Training

Why FP8?

TODO: The memory and compute savings. Why DeepSeek went here when most were still on BF16.

The Fine-Grained Quantization Strategy

TODO: Tile-wise scaling, which operations stay in higher precision, handling of outliers.

Code: FP8 Training Utilities

TODO: Key quantization and dequantization routines.

Part 5: Putting It All Together

The Full Model Architecture

TODO: How MLA + MoE + MTP compose. The full forward pass.

Training Infrastructure

TODO: 2048 H800 GPUs, pipeline parallelism, expert parallelism, DualPipe.

The Results

TODO: Benchmark comparisons, serving efficiency numbers, cost analysis.

Key Takeaways

TODO: What practitioners should take away. What's reusable at smaller scales. What requires DeepSeek-level resources.