Tanul Singh
BlogPapersProjectsAbout

papers/

Papers, Annotated

Reading notes for papers I've worked through — the diagrams I wished existed, the derivations I had to redo, and the parts that took me longest to understand.

Annotated

May 13, 2026Gemma 4 hybrid attention — alternating sliding-window and global-attention layers stacked, with MoE expert clusters
Dissecting Gemma 4: Architecture from the Ground Up

Five local layers, then one global. The hybrid rhythm.

May 12, 2026DeepSeek-V3 architecture — MLA with decoupled RoPE, DeepSeekMoE, Multi-Token Prediction, and FP8 training stacked as four modules
Building DeepSeek-V3 from Ground Up

The Innovations in the Modern LLM

February 12, 2022DeBERTa's disentangled attention — content and position as parallel input streams
DeBERTa is the New King

What if attention separated content from position?

February 12, 2022Longformer's sparse attention pattern — sliding window diagonal plus a few global tokens as crossing rows and columns
LongFormer: The Long Document Transformer

Attention as a mostly-empty matrix.

© 2026 Tanul Singh. Built with curiosity.