Demystifying the misconception of "Lossless Compression as Intelligence"

Debates on LLM compression theory reveal persistent misconceptions. Crucially, compression lies at the heart of the LLM revolution—illuminating its divine spark. Time for some clarification.

There are two interconnected core issues in the explosive growth of contemporary generative AI and large models that are most worth understanding thoroughly—otherwise, we're essentially allowing ourselves to live in medieval darkness. The first is how sequence learning unlocked universal tasks, making artificial general intelligence possible and transforming AGI from fringe science or science fiction into reality. I've written several blog posts attempting to explain this, though I'm not entirely certain I've conveyed it accurately. The second is the compression theory underlying large model intelligence. This issue has only recently become clear to me, with its context and principles finally falling into place. I feel it's worth sharing these insights.

A critical myth persists:  "Intelligence equals lossless compression" or "Lossless compression begets intelligence."

Both are false.

Compression produces intelligence—that's correct. But it's definitely not lossless compression that produces intelligence.

There's a cognitive error at play: many people conflate intelligent compression during training (lossy abstraction) with technical compression for specific applications (lossless encoding/decoding).

Compression has two distinct meanings: first, extracting every drop of insight and regularity from data, approaching theoretical optimality (K-complexity)—this is compression's true definition and intelligence's manifestation. Second, lossless compression, which requires perfect restoration of original data. Lossless compression/restoration isn't a genuine intelligence goal; at most, it's an application requirement (such as in archiving or transmission scenarios).

Lossless compression directly serves lossless restoration, where the lossless standard demands 100% restoration of input information in output (bit-level, including imperfections). Clearly, without resorting to form (instead of meaning), losslessness is ill-defined or meaningless. This differs from ultimate information compression, which targets meaning instead of form.  Semantic space embodies the true essence of teh statement "compression equals intelligence." Recognizing this distinction is key to dispelling the myth.

GPT, as a general intelligence agent, derives its core value from creative "distortion" in generation tasks (most application scenarios such as creative writing), while lossless compression is merely a technical byproduct (few application scenarios, such as for storage and transmission), and this capability only weakly correlates with intelligence level (when involving compression ratio, but unrelated to lossless restoration goals).

Attempting to prove model intelligence through lossless compression capability is inappropriate—like measuring legislators' competence by clerks' shorthand speed. These represent fundamentally different pathways:

Intelligent compression pursues minimal causal generation rules (K-complexity pathway), requiring active payment of "abstraction tax"; lossless compression pursues data restoration fidelity, leading to sacrificed model simplicity.

GPT's revolutionary nature lies in the former; the latter is merely a technical byproduct. In mainstream scenarios like generation and reasoning, creativity (or creative distortion) truly represents intelligence's brilliance, though its side effect of hallucination becomes large models' inherent challenge in some specific task scenarios (e.g. summarization, translation).

GPT uses next token prediction as its autoregressive training objective, seemingly a type of formal compression since the next token is its gold standard. But in implementation, it's unmistakably a semantic compression. At the micro level, next token prediction accuracy isn't measured by whether model output tokens match gold standards at the formal level, but through cross-entropy of internal token representations, measuring alignment between output and gold standards in semantic space. At the macro level, GPT trains on big data as a whole, not just targeting individual data points (a passage, a song, or an image). Lossless compression/restoration has clear definition for individual data points (100% formal restoration), but facing big data, this definition becomes impractical (unless for original data storage). In other words, big data compression determines it can only be semantic-level compression, mining the regularity behind big data.

Regarding GPT-enabled lossless restoration applications, large models' theoretical foundation of Kolmogorov complexity (K-complexity) supports the "lossy training-lossless application" framework. K-complexity pursues minimal generation programs, not data restoration capability. During training, lossy compression is the only path to approach K-complexity; during application, lossless restoration benefits from GPT's regularity to achieve unprecedented high compression ratios, as hes been verified by a number of researchers.

Actually, the famous scaling law for large model training emerges from this principle. This empirical observation and insight demonstrate that loss is necessary for intelligence: data must far exceed model size for intelligence improvement (otherwise the model "cheats" by memorizing and overfitting in vast parameters rather than continuously compressing and generalizing).

From another perspective, lossless restoration is an algorithmic property, not directly related to K-complexity. In fact, lossless restoration experiments show algorithms can always achieve lossless goals. Essentially: lossless restoration = model + delta. This delta represents the "abstraction tax" paid by the model—details the model didn't remember and needn't remember. In practice, powerful models yield smaller deltas; weaker models yield larger deltas. Lossless compression algorithms simply play this game. During application, model quality affects efficiency (compression ratio) but doesn't affect losslessness. Delta equals zero means the model remembered every detail, requiring the model to approach infinity or euipped with external massive storage. The other extreme is an infinitely small model or no model, degenerating the system into pure storage (hard disk). Disregarding compression ratio: white noise's K(x)≈|x| can still be precisely restored using lossless compression (like ZIP).

In textbooks, K-complexity is defined as a measure of data's intrinsic structure—"the length of the shortest program that outputs the string"—uncomputable in theory. Lossless compression is viewed as an engineering implementation for precise data restoration. Large models' emergence, verified by multiple scholars, indeed dramatically improves lossless compression ratios but doesn't change lossless compression's nature as merely an engineering tool. Of course, dramatically improved compression ratios also indicate that large models grasp data distribution regularity to unprecedented heights. However, regarding complexity theory, lossless compression/restoration often misleads. But high compression ratios during lossless restoration indeed strongly evidence large models' high intelligence, as no other knowledge system can surpass them.

Additionally, this topic has a crucial temporal dimension. Compression targets historical data, while predictive applications point toward the future (models as prophets), yet restoration only refers to historical data. This means even if lossless compression/restoration achieves ultimate compression ratios, it remains a distance from true predictive capability because there's a temporal wall between them. Crucially, intelligence's essence favors future prediction over historical restoration. Future prediction requires space for random sampling, but historical restoration precisely kills this beneficial randomness.

 

 

发布者

立委

立委博士,多模态大模型应用高级咨询。出门问问大模型团队前工程副总裁,聚焦大模型及其AIGC应用。Netbase前首席科学家10年,期间指挥研发了18种语言的理解和应用系统,鲁棒、线速,scale up to 社会媒体大数据,语义落地到舆情挖掘产品,成为美国NLP工业落地的领跑者。Cymfony前研发副总八年,曾荣获第一届问答系统第一名(TREC-8 QA Track),并赢得17个小企业创新研究的信息抽取项目(PI for 17 SBIRs)。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理