Part 3 of 3
The Gradient Descent through Transformers
Positional Encoding Part 2 — RoPE, ALiBi, and the Quest for Length Generalization
This is Part 3 of The Gradient Descent through Transformers — a series where I walk through every component of the modern transformer stack, how it evolved from 2017 to 2026, and why each piece matters.
Previously: Positional Encoding Part 1 — Why Transformers Need to Know Where Words Are
In the previous post, we saw how sinusoidal and learned positional encodings give transformers a sense of order by adding position vectors to the input. We also saw their fundamental weaknesses:
- Position information is added once and dilutes through layers
- The model must learn to extract relative distances — it's not explicit
- Extrapolation to longer sequences fails
All of these stem from the same root problem: absolute position is the wrong abstraction for language. When you read "the cat sat on the mat," you don't think "the word at position 1 relates to the word at position 2." You think "the subject is right before the verb." Relative position — how far apart things are — is what matters.
This post covers the three major approaches that moved position encoding from the input layer into the attention mechanism itself: T5's relative bias, ALiBi, and RoPE.
The Key Shift: From Input to Attention
In absolute methods, position information lives in the input:
The attention computation then has to hope that position information survives through the Q, K projections and the dot product. It's indirect.
Relative methods take a fundamentally different approach: inject position information directly into the attention scores. Instead of modifying the input, they modify what the model pays attention to. The attention score between tokens and becomes a function of both their content AND their relative distance :
This is a stronger inductive bias. The model doesn't have to learn that position matters — the architecture explicitly encodes it in every attention computation, at every layer.
T5's Relative Position Bias (Raffel et al., 2020)
T5 (Raffel et al., 2020) introduced one of the first successful relative position methods. The idea is elegantly simple.
How It Works
After computing the standard content-based attention scores , T5 adds a learned bias based on the relative distance between the query and key positions:
Where is a learned scalar looked up from a bias table indexed by the relative distance .
That's it. No modification to Q, K, or V. No special embeddings. Just a learned number added to each attention score based on how far apart the two tokens are.
The Bias Table
The bias table maps relative distances to scalar values. For example:
| Relative distance | Learned bias |
|---|---|
| 0 (same position) | +2.1 |
| 1 (adjacent, looking back) | +1.8 |
| 2 | +1.2 |
| 5 | +0.4 |
| 10 | -0.1 |
| 50 | -0.8 |
The model typically learns positive biases for nearby tokens (encouraging local attention) and negative or near-zero biases for distant tokens.
Bucketing for Efficiency
A naive implementation would need a bias entry for every possible distance from to . For long sequences, that's impractical. T5 uses logarithmic bucketing: nearby distances get exact entries (distance 0, 1, 2, ... up to some threshold), while larger distances share buckets (distances 50-100 might all use the same bias value).
This reflects a linguistic intuition: the difference between "1 token away" and "2 tokens away" matters a lot. The difference between "50 tokens away" and "51 tokens away" barely matters.
Strengths
- Simple and intuitive: just add a number based on distance.
- Each attention head learns its own bias table: some heads can become "local" (strongly prefer nearby tokens) while others become "global" (uniform or long-range attention). This specialization happens naturally.
- Position information at every layer: unlike absolute PE which is added once at the input, the relative bias is applied at every attention layer. No information diffusion problem.
Limitations
- The bias table is finite: if the table covers distances up to 128, the model doesn't know what to do with distance 200. Extrapolation beyond the trained range is undefined.
- Extra parameters: each attention head has its own bias table. With 12 heads × 128 buckets, that's 1,536 learned parameters per layer — small but nonzero.
- Content and position are independent: the bias doesn't interact with the content. The model can't learn "attend to the nearest verb" — only "attend to nearby tokens" regardless of what they are.
ALiBi: Attention with Linear Biases (Press et al., 2022)
ALiBi (Press et al., 2022) asked: what if we don't even need to learn the position biases? What if a fixed, simple formula works just as well?
The Idea
Instead of a learned bias table, ALiBi adds a fixed linear penalty based on distance:
Where is a fixed slope (not learned) that varies per attention head.
In words: the farther apart two tokens are, the more the attention score is penalized. The penalty grows linearly with distance. Each head has a different slope , giving heads different "attention windows."
Head-Specific Slopes
The slopes are set using a geometric sequence. For 8 heads:
Head 1 (slope 0.5): aggressive locality — attention drops off quickly with distance. This head focuses on immediate neighbors.
Head 8 (slope 0.0039): very gradual penalty — can attend to tokens hundreds of positions away. This head captures long-range dependencies.
Why It Works Despite Being So Simple
ALiBi works because:
-
Language is already local: most important dependencies in language are nearby. A subject is usually close to its verb. A pronoun is close to its referent. The linear penalty just formalizes this prior.
-
Different heads handle different scales: the geometric slopes give the model both local and global attention in the same layer, without learning.
-
No position encoding to dilute: since there's no additive PE in the input, the token embeddings are pure content vectors. Position lives entirely in the attention scores where it's used.
The Extrapolation Breakthrough
ALiBi's biggest contribution: it extrapolates to longer sequences than it was trained on. If you train a model with ALiBi on sequences of length 1024, it works reasonably well on sequences of length 2048 or 4096 at inference time.
Why? Because the linear penalty formula is defined for any distance. There's no table to run out of, no learned position embedding to be undefined. Position 2000 naturally gets a larger penalty from position 0 — the formula just works.
This was a major result. Previous methods (sinusoidal, learned, T5 bias) all degraded sharply beyond training length. ALiBi showed that extrapolation is possible with the right design.
Limitations
- Fixed, not learned: the slopes can't adapt to the task. If a specific task needs unusual attention patterns, ALiBi can't accommodate.
- Linear assumption: the penalty is always linear with distance. Some tasks might benefit from more complex decay patterns (e.g., sharp local attention with a long flat tail for retrieval).
- Still absolute in disguise: the penalty depends on which is computed from absolute positions. It doesn't capture richer relative relationships (e.g., "same sentence" vs "different sentence").
RoPE: Rotary Position Embedding (Su et al., 2021)
RoPE (Su et al., 2021) is the method that won. It powers Llama, Mistral, Qwen, Gemma, PaLM, Phi, and virtually every major LLM from 2023 onward. Understanding RoPE deeply is essential.
The Core Insight
Remember from Part 1 that sinusoidal positional encoding has a rotation property: . The encoding for a shifted position is a rotation of the original encoding.
RoPE takes this idea and asks: what if instead of adding position to the input, we apply the rotation directly to the Query and Key vectors inside the attention computation?
How It Works
In standard attention, the dot product between query and key measures content similarity:
In RoPE, we rotate by position and rotate by position before taking the dot product:
Where is a rotation matrix determined by position .
Why This Encodes Relative Position
Here's the beautiful part. The dot product of two rotated vectors has this property:
The dot product depends on — a rotation by the relative distance . The absolute positions and disappear, and only their difference remains.
This means:
- Token at position 3 attending to token at position 7: rotation by offset 4
- Token at position 100 attending to token at position 104: same rotation by offset 4
- Identical relative position → identical effect on attention scores
The Rotation in Detail
RoPE operates on pairs of dimensions. For a -dimensional vector, it groups dimensions into pairs and applies a 2D rotation to each pair:
For dimensions , the rotation for position is:
Where — the same frequency formula as sinusoidal PE!
Each dimension pair gets rotated by a different amount (determined by position and frequency). Low-dimensional pairs rotate quickly (capturing local position). High-dimensional pairs rotate slowly (capturing global position). It's the same multi-frequency clock idea — but applied as a rotation to Q and K rather than an addition to the input.
A Worked Example: RoPE on a 4D Vector
Let's trace through the full computation to make this concrete. Take a tiny model with (so dimension pairs).
The sequence: ["The", "cat", "sat"] at positions 0, 1, 2.
Frequencies for each dimension pair:
- Pair 0 (dims 0,1): — fast
- Pair 1 (dims 2,3): — slow
Say "cat" at position 1 has query vector after the Q projection.
Split into pairs:
- Pair 0:
- Pair 1:
Rotate each pair by :
Pair 0: angle = radian (57.3°)
Pair 1: angle = radian (0.57° — barely moves!)
Rotated query for "cat" at pos 1:
Notice: pair 0 got rotated significantly (57°), pair 1 barely changed (0.57°). This is the multi-speed clock in action.
Now do the same for every token:
| Token | Pos | Pair 0 angle | Pair 1 angle |
|---|---|---|---|
| "The" | 0 | rad (no rotation) | rad |
| "cat" | 1 | rad | rad |
| "sat" | 2 | rad | rad |
Every token's Q and K vectors undergo this same process — split into pairs, each pair rotated by its own angle based on position.
The attention computation:
When "cat" (pos 1) attends to "sat" (pos 2), the dot product of their rotated Q and K vectors is computed. In pair 0, the relative rotation is rad. In pair 1, it's rad.
If "cat" were at position 50 and "sat" at position 51, the relative rotations would be exactly the same: rad and rad. Same offset → same rotations → same dot product → same attention score.
This is how RoPE achieves true relative position encoding: every token gets rotated independently by its absolute position, but the attention mechanism only ever sees the difference in rotations — which is the relative distance.
See It in Action
Now that you understand the rotation mechanics, play with the visualizer below. Remember: the full Q and K vectors have dimension pairs, each rotating at a different speed . This visualizer shows what happens in one dimension pair — you can adjust to simulate different pairs (high = low dimension pair that rotates fast, low = high dimension pair that rotates slowly).
The key experiment: keep the offset (i−j) the same but change both absolute positions. Watch how the individual vectors rotate to completely different angles — but the dot product stays constant. That's RoPE: absolute positions change everything visually, but the attention score depends only on the relative distance.
interactive
RoPE — How Rotation Encodes Relative Position
high θ = low dimension (fast rotation, local position) · low θ = high dimension (slow rotation, global position)
q rotated by: θ×i = 0.50×3 = 1.50 rad (85.9°)
k rotated by: θ×j = 0.50×7 = 3.50 rad (200.5°)
offset (i−j): -4 → angle difference: -2.00 rad (-114.6°)
dot product: -6310.53
Rotation matrices applied:
R3 (for query at pos 3):
┌ cos(1.50) −sin(1.50) ┐ ┌ 0.071 -0.997 ┐
└ sin(1.50) cos(1.50) ┘ = └ 0.997 0.071 ┘
R7 (for key at pos 7):
┌ cos(3.50) −sin(3.50) ┐ ┌ -0.936 0.351 ┐
└ sin(3.50) cos(3.50) ┘ = └ -0.351 -0.936 ┘
Effective relative rotation R-4:
angle = θ×(i−j) = 0.50×-4 = -2.00 rad
(Ri·q)·(Rj·k) = q·Ri−j·k → only offset matters!
Try this: set i=3, j=7 (offset = −4). Note the dot product. Now set i=10, j=14 (same offset = −4). The dot product is identical. Same relative distance → same attention score, regardless of absolute position.
Why RoPE Won
-
Truly relative: the attention score depends only on , not on absolute positions. This is the cleanest relative position encoding possible.
-
Content-position interaction: unlike T5 bias or ALiBi which add position independently from content, RoPE rotates the content vectors. The attention score is a function of both content AND relative position simultaneously. The model can learn "attend to the nearest token that looks like a verb" — not just "attend to nearby tokens."
-
No extra parameters: like sinusoidal PE, RoPE is a fixed mathematical transformation with zero learned parameters for position.
-
Position at every layer: the rotation is applied to Q and K at every attention layer. No information diffusion problem — position is refreshed at every layer.
-
Compatible with KV caching: during inference, once a key vector is computed and rotated for position , it stays valid regardless of what future tokens appear. This makes autoregressive generation with KV cache straightforward.
-
Length extension via interpolation: by scaling the frequencies , you can extend the context window beyond training length. This led to techniques like NTK-aware interpolation and YaRN that extend Llama's 4K context to 100K+ with minimal quality loss.
The Frequency Base: 10000 and Beyond
The choice of base 10000 determines how quickly each dimension pair rotates. Recent work (including Llama 3's "rope_theta" of 500000) has shown that increasing the base extends the context window — it makes all rotations slower, allowing the model to distinguish positions farther apart.
Llama 2: base = 10000, context = 4096 Llama 3: base = 500000, context = 128K
This is a simple but powerful knob for scaling context length.
Comparison: The Evolution
| Method | Year | Position in arch | Relative? | Extrapolation | Used by |
|---|---|---|---|---|---|
| Sinusoidal | 2017 | Input (additive) | Implicit | Poor | Original Transformer |
| Learned | 2018 | Input (additive) | No | None | BERT, GPT-2/3 |
| T5 Bias | 2020 | Attention scores | Yes (learned) | Limited | T5, Flan-T5 |
| ALiBi | 2022 | Attention scores | Yes (fixed) | Good | BLOOM, MPT |
| RoPE | 2021 | Q,K rotation | Yes (implicit) | Extensible | Llama, Mistral, Qwen, Gemma, PaLM, Phi |
The Clear Winner
RoPE dominates for a reason: it's the only method that simultaneously achieves:
- True relative position encoding
- Content-position interaction (not independent)
- Zero extra parameters
- Position information at every layer
- Compatibility with efficient inference (KV cache)
- Extensible context length via frequency scaling
The field isn't debating position encoding anymore. RoPE won. The remaining research is about how to extend its effective context length further (NTK interpolation, YaRN, LongRoPE) and how to combine it with other attention modifications.
References & Further Reading
- Attention Is All You Need (Vaswani et al., 2017) — The original transformer with sinusoidal positional encoding.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2020) — T5, which introduced relative position bias.
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization (Press et al., 2022) — ALiBi paper.
- RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021) — The RoPE paper.
- Extending Context Window of Large Language Models via Position Interpolation (Chen et al., 2023) — Position interpolation for extending RoPE.
- YaRN: Efficient Context Window Extension of Large Language Models (Peng et al., 2023) — Advanced RoPE extension.
- The Impact of Positional Encoding on Length Generalization in Transformers (Kazemnejad et al., 2023) — Systematic comparison of positional encoding methods.
Next in the series: Attention Mechanisms — from vanilla self-attention to Flash Attention, Grouped-Query Attention, and the quest to make attention scale.
This post is part of The Gradient Descent through Transformers — a series dissecting every component of the modern transformer stack.