Why UTF-8 doesn't compress anything
I was watching 3Blue1Brown's video on information theory when this hit me: the formula for information content is . So if every character in UTF-8 takes exactly 8 bits, we're treating every character as equally likely — and that means we're getting nothing back from the compression standpoint.
Let me work through it.
UTF-8 uses 1 byte (8 bits) for basic English characters. 8 bits encodes possible values. If we say each of those 256 characters appears with equal probability, then the probability of any one character is .
The information content per character:
The math matches the architecture exactly. When you assume a uniform distribution, every character carries exactly 8 bits of information, so you need exactly 8 bits of storage.
But English isn't uniform. The letter E shows up about 12.7% of the time, the letter Z about 0.07%. Their actual information content:
E: bitsZ: bits
UTF-8 gives E 8 bits of space even though it only carries ~3 bits of actual information. That's the waste. Real compression algorithms (Huffman codes, arithmetic coding, what ZIP does) abandon fixed lengths — they give common letters short codes and rare letters long codes. The average drops from 8 bits/character to roughly 4–5 bits/character, which is around the true entropy of English.
The insight: uniform encoding is a worst-case strategy disguised as a neutral choice. Treating everything as equally likely is mathematically the same as throwing away every prior you have about the data. UTF-8 isn't "uncompressed by accident" — it's provably uncompressed because of the assumption baked into its design.
This is why Shannon's entropy is the floor: it's the average bits per symbol you'd need under the true distribution. UTF-8 sits at 8. Real English sits at ~4–5. The gap is exactly the prior information UTF-8 is ignoring.