cd ../blog
PapersGemmaGoogle DeepMindTransformersMoEPaper × Code

Dissecting Gemma 4: Architecture from the Ground Up

May 13, 20262 min read

Part of the Paper × Code series — where we dissect papers and rebuild them from scratch.


Coming Soon. This post is currently being written. Check back soon for the full architectural dissection.


What We're Dissecting

Paper: Gemma 4 Technical Report
Authors: Google DeepMind (2025)

The architecture at a glance:

  • Multiple variants: E2B, E4B, 26B-A4B (MoE), 31B (dense)
  • Hybrid attention: sliding window (local) + full attention (global) layers
  • Mixture of Experts in the 26B-A4B variant
  • 256K context (larger variants)
  • Builds on Gemma 3's proven hybrid pattern with tighter windows

We'll take each component apart, understand why it's there, and see how the full model composes.


Part 1: Hybrid Attention Design

Sliding Window Layers

TODO: Window sizes per variant (512 vs 1024). Why different scales need different windows. Rolling KV buffer implementation.

Global Attention Layers

TODO: Ratio of global to local. Which layers are global. How the "highway" pattern carries long-range information.

The Evolution from Gemma 3

TODO: What changed from Gemma 3's 1:5 ratio. Why tighter windows work at this scale. The interplay with longer context (128K → 256K).


Part 2: Mixture of Experts (26B-A4B)

Router Design

TODO: How tokens get routed to experts. Top-k selection. Load balancing.

Expert Architecture

TODO: Number of experts, active experts per token. FFN structure within each expert.

When MoE Meets Hybrid Attention

TODO: How MoE layers interact with the hybrid attention pattern. Which layers are MoE vs dense.


Part 3: The Dense Variant (31B)

Architectural Choices

TODO: How the 31B dense model differs from the MoE variant. Attention configuration. FFN sizing.

Performance Tradeoffs

TODO: When to choose dense vs MoE. Serving considerations. Quality comparisons.


Part 4: Training and Scaling

Training Infrastructure

TODO: Data, compute, training details from the report.

Scaling Decisions

TODO: Why these specific model sizes. The E2B/E4B variants for edge deployment.


Key Takeaways

TODO: What practitioners should take from Gemma 4's design. Reusable patterns at smaller scales. How it compares to DeepSeek-V4 and Qwen 3 Next.