Beyond Attention: Anatomy of a Modern Transformer
Attention gets all the press, but the rest of the transformer matters just as much. This post traces how feed-forward networks, normalization, residual connections, activations, and embeddings evolved from the original 2017 design to the modern LLM recipe — and why seemingly small changes (removing a bias term, swapping an activation function) compound into models that train faster, scale further, and perform better.