Dissecting Gemma 4: Architecture from the Ground Up

Part of the Paper × Code series — where we dissect papers and rebuild them from scratch.

Coming Soon. This post is currently being written. Check back soon for the full architectural dissection.

What We're Dissecting

Paper: Gemma 4 Technical Report
Authors: Google DeepMind (2025)

The architecture at a glance:

We'll take each component apart, understand why it's there, and see how the full model composes.

TODO: Window sizes per variant (512 vs 1024). Why different scales need different windows. Rolling KV buffer implementation.

TODO: Ratio of global to local. Which layers are global. How the "highway" pattern carries long-range information.

TODO: What changed from Gemma 3's 1:5 ratio. Why tighter windows work at this scale. The interplay with longer context (128K → 256K).

TODO: How tokens get routed to experts. Top-k selection. Load balancing.

TODO: Number of experts, active experts per token. FFN structure within each expert.

TODO: How MoE layers interact with the hybrid attention pattern. Which layers are MoE vs dense.

TODO: How the 31B dense model differs from the MoE variant. Attention configuration. FFN sizing.

TODO: When to choose dense vs MoE. Serving considerations. Quality comparisons.

TODO: Data, compute, training details from the report.

TODO: Why these specific model sizes. The E2B/E4B variants for edge deployment.

TODO: What practitioners should take from Gemma 4's design. Reusable patterns at smaller scales. How it compares to DeepSeek-V4 and Qwen 3 Next.