Imagine a cosmic library holding every possible sentence, from the profound “Artificial intelligence will reshape the future” to the absurd “Cat pillow jumps blue.” Popular sentences sit on prominent shelves, easily found with a short note like “Shelf 3, Book 5.” Random gibberish hides in dusty basements, requiring a long, word-for-word map. GPT, our cosmic librarian, navigates this library with uncanny precision, compressing texts into compact codes that can be perfectly restored. How does it work, and why is this a game-changer for data compression?
The Library of Language
In this infinite library, each sentence has a “popularity” score—its probability based on grammar, meaning, and context. GPT’s world model, trained on vast texts, assigns high probabilities to meaningful sentences, making them easier to locate. For example, “Artificial intelligence will reshape the future” is a bestseller, while “Cat pillow jumps blue” is obscure. Compression is about encoding these locations efficiently. How might GPT’s understanding of language make this possible?
Arithmetic Coding: The Magic Wand
GPT teams up with arithmetic coding to turn sentences into numbers. Here’s how it compresses “Artificial intelligence will reshape…” (tokenized as “Artificial,” “intelligence,” “will,” …):
- Start with [0.0, 1.0]: The entire number line as space that represents all possible sequences.
- Encode “Artificial”: GPT predicts a 5% chance (P=0.05) for this word to be the first token in a sentence, shrinking the interval to [0.0, 0.05].
- Encode “intelligence”: Given “Artificial,” GPT predicts an 80% chance (P=0.8), narrowing to [0.0, 0.04].
- Continue: Each token shrinks the interval further, ending with a tiny range, say [0.02113, 0.02114].
- Output: Convert a number like 0.02113 to binary (e.g., 0.00010101), which is the compressed result of the processed sentence.
Decompression reverses this, using the same GPT model to retrace the intervals and reconstruct the exact text. Why does this ensure no data is lost?
Information Theory: Why Predictability Saves Space
Information theory reveals why this works. A word’s information content is -log₂(P(x)). High-probability words carry little information, rare words carry more. Predictable sentences, rich in semantic patterns, form larger intervals in the line, requiring fewer bits. Why might random text, like white noise, resist compression?
Why It Matters
This approach could revolutionize data storage and transmission, from archiving logs to sending messages across galaxies. But what challenges might arise in real-world applications? How could GPT’s predictive power evolve with better models?
Original post: https://liweinlp.com/13277