← notes/
cd ../blog
Information TheoryCompressionNotes

Why UTF-8 doesn't compress anything

June 6, 20262 min read

I was watching 3Blue1Brown's video on information theory when this hit me: the formula for information content is log2(p)-\log_2(p). So if every character in UTF-8 takes exactly 8 bits, we're treating every character as equally likely — and that means we're getting nothing back from the compression standpoint.

Let me work through it.

UTF-8 uses 1 byte (8 bits) for basic English characters. 8 bits encodes 28=2562^8 = 256 possible values. If we say each of those 256 characters appears with equal probability, then the probability of any one character is 1256\frac{1}{256}.

The information content per character:

I=log2(1256)=8 bitsI = -\log_2\left(\frac{1}{256}\right) = 8 \text{ bits}

The math matches the architecture exactly. When you assume a uniform distribution, every character carries exactly 8 bits of information, so you need exactly 8 bits of storage.

But English isn't uniform. The letter E shows up about 12.7% of the time, the letter Z about 0.07%. Their actual information content:

  • E: log2(0.127)2.97-\log_2(0.127) \approx 2.97 bits
  • Z: log2(0.0007)10.48-\log_2(0.0007) \approx 10.48 bits

UTF-8 gives E 8 bits of space even though it only carries ~3 bits of actual information. That's the waste. Real compression algorithms (Huffman codes, arithmetic coding, what ZIP does) abandon fixed lengths — they give common letters short codes and rare letters long codes. The average drops from 8 bits/character to roughly 4–5 bits/character, which is around the true entropy of English.

The insight: uniform encoding is a worst-case strategy disguised as a neutral choice. Treating everything as equally likely is mathematically the same as throwing away every prior you have about the data. UTF-8 isn't "uncompressed by accident" — it's provably uncompressed because of the assumption baked into its design.

This is why Shannon's entropy is the floor: it's the average bits per symbol you'd need under the true distribution. UTF-8 sits at 8. Real English sits at ~4–5. The gap is exactly the prior information UTF-8 is ignoring.