papers/

Papers, Annotated

Reading notes for papers I've worked through — the diagrams I wished existed, the derivations I had to redo, and the parts that took me longest to understand.

Annotated

May 13, 2026

Gemma 4 hybrid attention — alternating sliding-window and global-attention layers stacked, with MoE expert clusters

Dissecting Gemma 4: Architecture from the Ground Up

Five local layers, then one global. The hybrid rhythm.

May 12, 2026

DeepSeek-V3 architecture — MLA with decoupled RoPE, DeepSeekMoE, Multi-Token Prediction, and FP8 training stacked as four modules

Building DeepSeek-V3 from Ground Up

The Innovations in the Modern LLM

February 12, 2022

DeBERTa's disentangled attention — content and position as parallel input streams

DeBERTa is the New King

What if attention separated content from position?

February 12, 2022

Longformer's sparse attention pattern — sliding window diagonal plus a few global tokens as crossing rows and columns

LongFormer: The Long Document Transformer

Attention as a mostly-empty matrix.