Why Every LLM Uses RoPE (Even When It Doesn't Extrapolate)
Vanilla RoPE doesn't extrapolate cleanly to sequences longer than its training context. ALiBi does. Yet every production LLM uses RoPE plus an extension trick like YaRN. This is the story of how I discovered that, traced its consequences experimentally, and reconciled it with what production models actually do.